Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to reversetunnel connections #14027

Merged
merged 5 commits into from
Jul 6, 2022
Merged

Add metrics to reversetunnel connections #14027

merged 5 commits into from
Jul 6, 2022

Conversation

dboslee
Copy link
Contributor

@dboslee dboslee commented Jun 30, 2022

These metrics will help to see the difference in latency between
connections over local tunnels, peer proxies, and direct dials.

The time to establish a connection is measured from the top of the function until the connection is returned.
The time to first use a connection is measured from when the connection is established to when the first read returns.

Implements part of https://github.com/gravitational/cloud/issues/1823

These metrics will help to see the difference in latency between
connections over local tunnels, peer proxies, and direct dials.

Co-authored-by: Vitor Enes <vitor.duarte@goteleport.com>
@dboslee dboslee force-pushed the david/conn-metrics branch from 9037af4 to 3c7dd91 Compare June 30, 2022 21:22
@dboslee dboslee added observability Used for metrics and insight into Teleport. backport/branch/v10 proxy-peering labels Jun 30, 2022
@dboslee dboslee marked this pull request as ready for review June 30, 2022 21:27
@github-actions github-actions bot requested a review from marcoandredinis June 30, 2022 21:28

// start is the time since the last state was reported.
start time.Time
firstUse sync.Once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we could maybe rename all instances of "first use" to "first read", or something else that better represents a connection state (the histogram label) like "established".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for first read
You can also change to another variation and add the Once suffix which seems common on the codebase

@vitorenesduarte vitorenesduarte self-assigned this Jul 1, 2022
@espadolini
Copy link
Contributor

Couldn't this be calculated from tracing data instead? Do we need a separate metric for it?

@rosstimothy
Copy link
Contributor

Couldn't this be calculated from tracing data instead? Do we need a separate metric for it?

We could trace this, however I don't think that you get the nice aggregated metrics and be able to monitor via a grafana dashboard as easily.

Copy link
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit


// addConn updates the connection and dial type. It also reports the time
// it took to establish the connection.
func (c *metricConn) addConn(conn net.Conn, dt dialType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems racy. I know the intended use case is to only call it once, perhaps we could enforce that better somehow? Perhaps a Dialer which returned you a net.Conn which was already wrapped by a metricConn?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be ok with this design - I like being explicit with when we start the timer, and our dialing isn't always easily expressed with a single dialer function - but we should enforce that we can only add a connection once.

Perhaps we could just take the time before we begin the dial procedures and have a constructor for the wrapper that takes the Conn, whatever metadata we have and the "time in which we started dialing"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated here 5e6a132

Copy link
Contributor

@marcoandredinis marcoandredinis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we are calculating the total time from dial until the end of the first read.
I would expect this to be two metrics:

  • time to connect: checks for DNS and latency times (among other things)
  • time to first read: checks for server responsiveness

I'm saying this because they usually represent different infrastructure, and having this context should be helpful

If we want to store the time from dial until the end of the first read I would name the metric.
time_to_first_byte seems to be a somewhat common expression to denote how long it takes from dial to the first read.

@dboslee dboslee enabled auto-merge (squash) July 6, 2022 19:58
@dboslee dboslee merged commit 0add854 into master Jul 6, 2022
@github-actions
Copy link

github-actions bot commented Jul 6, 2022

@dboslee See the table below for backport results.

Branch Result
branch/v10 Create PR

dboslee added a commit that referenced this pull request Jul 7, 2022
These metrics will help to see the difference in latency between
connections over local tunnels, peer proxies, and direct dials.

Co-authored-by: Vitor Enes <vitor.duarte@goteleport.com>
@webvictim webvictim mentioned this pull request Jul 12, 2022
@zmb3 zmb3 deleted the david/conn-metrics branch September 9, 2022 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
observability Used for metrics and insight into Teleport. proxy-peering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants