Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relay service: add metrics #2154

Merged
merged 8 commits into from
Mar 7, 2023
Merged

Conversation

sukunrt
Copy link
Member

@sukunrt sukunrt commented Mar 1, 2023

Metrics Added:

ReservationRequest: Opened, Closed, Renewed
ReservationRequestResponseStatus
ReservationRejectionReason

ConnectionRequest: Opened, Closed
ConnectionRequestResponseStatus
ConnectionRejectionReason
ConnectionDuration

BytesTransferred

RelayStatus

@sukunrt sukunrt requested review from marten-seemann and removed request for marten-seemann March 1, 2023 13:29
@sukunrt sukunrt marked this pull request as ready for review March 2, 2023 06:14
p2p/protocol/circuitv2/relay/metrics.go Outdated Show resolved Hide resolved
p2p/protocol/circuitv2/relay/metrics.go Outdated Show resolved Hide resolved
p2p/protocol/circuitv2/relay/metrics.go Outdated Show resolved Hide resolved
Comment on lines 681 to 640
r.gc()
r.wg.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug in our implementation? If so, would you mind opening a separate PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a race condition. yes. Will do. #2162

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for the bug. I added the gc call there since it simplified counting closed connections.
https://github.com/libp2p/go-libp2p/pull/2164/files

break
}
if r.metricsTracer != nil {
r.metricsTracer.BytesTransferred(nr + nw)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't you counting every byte twice here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay true. bytes transferred should only be one of them. I did it because bandwidth used would be incoming + outgoing. But that should be handled in the dashboard. Will fix.

ConnectionClosed(d time.Duration)
// ConnectionRequestHandled tracks metrics on handling a relay connection request
// rejectionReason is ignored for status other than `requestStatusRejected`
ConnectionRequestHandled(status string, rejectionReason string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to split this up in ConnectionRequestHandled (for success) and ConnectionRequestRejected? What would be the difference between ConnectionRequestReceived and ConnectionRequestHandled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah received was the same. I've removed ConnectionRequestReceived and ReservationRequestReceived and kept ReservationRequestHandled and ConnectionRequestHandled

p := s.Conn().RemotePeer()
a := s.Conn().RemoteMultiaddr()

if isRelayAddr(a) {
log.Debugf("refusing relay reservation for %s; reservation attempt over relay connection")
r.handleError(s, pbv2.Status_PERMISSION_DENIED)
r.handleErrorAndTrackMetrics(s, pbv2.HopMessage_RESERVE, pbv2.Status_PERMISSION_DENIED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of the handleError function. I know you didn't introduce it, so this is not your fault :)

What about return an error and / or a status code, and handle it in the caller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lemme see how that looks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks much better! thanks!
I've left handleError as it is, if you want to remove that I can open a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's do this in a follow-up PR.

@marten-seemann marten-seemann linked an issue Mar 3, 2023 that may be closed by this pull request
@sukunrt sukunrt force-pushed the relay-svc-metrics branch 2 times, most recently from c3409a0 to 9121abd Compare March 3, 2023 06:15
Metrics Added:

ReservationRequest: Opened, Closed, Renewed
ReservationRequestResponseStatus
ReservationRejectionReason

ConnectionRequest: Opened, Closed
ConnectionRequestResponseStatus
ConnectionRejectionReason
ConnectionDuration

BytesTransferred

RelayStatus
Copy link
Contributor

@marten-seemann marten-seemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really cool to see how cheap it is to run the relay service. This was recorded over a 3h time frame:
image

dashboards/relaysvc/relaysvc.json Show resolved Hide resolved
"refId": "C"
}
],
"title": "Connection Duration",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram seems to be missing the unit.

I'm also not sure how useful it is, the graph looks kind of wild on my instance:
image

Maybe it's easier if we just track the average:

rate(libp2p_relaysvc_connection_duration_seconds_sum[$__range])/rate(libp2p_relaysvc_connection_duration_seconds_count[$__range])

(Please double-check if $__range is the appropriate variable here)

image

Copy link
Member Author

@sukunrt sukunrt Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm keeping it range because I think it is most informative, while rate_interval will give you average at that time the graph is very spiky and it is also an average over all connections in that period so the spike doesn't necessarily mean it's something wrong.
On the other hand increasing the dashboard range when using "range" will change this graph which seems wrong.
Screenshot 2023-03-07 at 12 32 24 PM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the label to rolling average.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

relay service: expose metrics
2 participants