-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics to reversetunnel connections #14027
Conversation
These metrics will help to see the difference in latency between connections over local tunnels, peer proxies, and direct dials. Co-authored-by: Vitor Enes <vitor.duarte@goteleport.com>
9037af4
to
3c7dd91
Compare
|
||
// start is the time since the last state was reported. | ||
start time.Time | ||
firstUse sync.Once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we could maybe rename all instances of "first use" to "first read", or something else that better represents a connection state
(the histogram label) like "established".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for first read
You can also change to another variation and add the Once
suffix which seems common on the codebase
Couldn't this be calculated from tracing data instead? Do we need a separate metric for it? |
We could trace this, however I don't think that you get the nice aggregated metrics and be able to monitor via a grafana dashboard as easily. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit
lib/reversetunnel/conn_metric.go
Outdated
|
||
// addConn updates the connection and dial type. It also reports the time | ||
// it took to establish the connection. | ||
func (c *metricConn) addConn(conn net.Conn, dt dialType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems racy. I know the intended use case is to only call it once, perhaps we could enforce that better somehow? Perhaps a Dialer which returned you a net.Conn
which was already wrapped by a metricConn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be ok with this design - I like being explicit with when we start the timer, and our dialing isn't always easily expressed with a single dialer function - but we should enforce that we can only add a connection once.
Perhaps we could just take the time before we begin the dial procedures and have a constructor for the wrapper that takes the Conn
, whatever metadata we have and the "time in which we started dialing"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated here 5e6a132
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we are calculating the total time from dial until the end of the first read.
I would expect this to be two metrics:
- time to connect: checks for DNS and latency times (among other things)
- time to first read: checks for server responsiveness
I'm saying this because they usually represent different infrastructure, and having this context should be helpful
If we want to store the time from dial until the end of the first read I would name the metric.
time_to_first_byte
seems to be a somewhat common expression to denote how long it takes from dial to the first read.
These metrics will help to see the difference in latency between
connections over local tunnels, peer proxies, and direct dials.
The time to establish a connection is measured from the top of the function until the connection is returned.
The time to first use a connection is measured from when the connection is established to when the first read returns.
Implements part of https://github.com/gravitational/cloud/issues/1823