-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(networkmonitor): add ping latencies, optimize reconnections #2068
Conversation
This PR may contain changes to configuration options of one of the apps. If you are introducing a breaking change (i.e. the set of options in latest release would no longer be applicable) make sure the original option is preserved with a deprecation note for 2 following releases before it is actually removed. Please also make sure the label |
You can find the image built from this PR at
Built from 48c47df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! left couple of comments.
curious about the how the histogram looks like in the current network.
allPeers[peerId].connError = msg | ||
return err("could not ping peer: " & msg) | ||
|
||
let timedOut = not await ping().withTimeout(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont fully understand proc ping(): Future[Result[void, string]] {.async, gcsafe.} =
and this let timedOut = not await ping().withTimeout(timeout)
isnt something like this better to handle the timeout?
if not (waitFornode.libp2pPing.ping(conn)).withTimeout(timeout)):
xxx
and if it errors before timeout, that is handled in the except
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:D This was the first thing that came to mind when trying to overcome the weirdness of withTimeout which would not return any result from ping.
I'll rewrite it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels to me that this inner ping()
function may be unnecessary and you can just dial and await the libp2pPing.ping()
directly? May be missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so looking at the code again, my problem was that I need the Connection
from dial()
to be able to call the ping
, but without withTimeout
on the dial
it will hang for some nodes. But then if I add withTimeout
I don't see a way to extract the returned Connection
.
This is why I wrapped it in a proc
and just set the variables directly from inside of the inner proc. If there is another (better) way to do it, I am happy to refactor, but I don't see
cc @jm-clius
allPeers[peerId].avgPingDuration = pingDelay | ||
|
||
# TODO: check why the calculation ends up losing precision | ||
allPeers[peerId].avgPingDuration = int64((float64(allPeers[peerId].avgPingDuration.millis) * (AvgPingWindow - 1.0) + float64(pingDelay.millis)) / AvgPingWindow).millis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this formula correct?
think its missing a /AvgPingWindow
old average * (n-1)/n + new value /n
something like this (unsure about the castings)
allPeers[peerId].avgPingDuration = int64((float64(allPeers[peerId].avgPingDuration.millis) * (AvgPingWindow - 1.0)/AvgPingWindow + float64(pingDelay.millis)) / AvgPingWindow).millis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that would not be correct IMO - you need to make avg represent 9 values in the window + new value, what you have there is make avg*0.9 + new value and then divide by 10 (check the parentheses)
I believe my formula is correct, just some weird issue with casting/rounding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missread the parenthesis in your formula. yours and mine are the same.
what you have there is make avg*0.9 + new value and then divide by 10
nope, since division has priority over sum,
old_average * (n-1)/n + new_value /n
is equivalent to
old_average * (n-1)/n + (new_value /n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as to not be a bottleneck, but made some comments below. Direction makes sense to me :)
# after connection, get supported protocols | ||
let lp2pPeerStore = node.switch.peerStore | ||
let nodeProtocols = lp2pPeerStore[ProtoBook][peerInfo.peerId] | ||
allPeers[peerId].supportedProtocols = nodeProtocols | ||
allPeers[peerId].lastTimeConnected = currentTime | ||
|
||
# after connection, get user-agent | ||
let nodeUserAgent = lp2pPeerStore[AgentBook][peer.get().peerId] | ||
allPeers[peerId].userAgent = nodeUserAgent | ||
# after connection, get user-agent | ||
let nodeUserAgent = lp2pPeerStore[AgentBook][peerInfo.peerId] | ||
allPeers[peerId].userAgent = nodeUserAgent | ||
|
||
# store avaiable protocols in the network | ||
for protocol in nodeProtocols: | ||
if not allProtocols.hasKey(protocol): | ||
allProtocols[protocol] = 0 | ||
allProtocols[protocol] += 1 | ||
# store avaiable protocols in the network | ||
for protocol in nodeProtocols: | ||
if not allProtocols.hasKey(protocol): | ||
allProtocols[protocol] = 0 | ||
allProtocols[protocol] += 1 | ||
|
||
# store available user-agents in the network | ||
if not allAgentStrings.hasKey(nodeUserAgent): | ||
allAgentStrings[nodeUserAgent] = 0 | ||
allAgentStrings[nodeUserAgent] += 1 | ||
# store available user-agents in the network | ||
if not allAgentStrings.hasKey(nodeUserAgent): | ||
allAgentStrings[nodeUserAgent] = 0 | ||
allAgentStrings[nodeUserAgent] += 1 | ||
|
||
debug "connected to peer", peer=allPeers[customPeerInfo.peerId] | ||
debug "connected to peer", peer=allPeers[customPeerInfo.peerId] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaict this extra indentation will now result in these being only set if the ping condition is met. Perhaps intentional?
allPeers[peerId].connError = msg | ||
return err("could not ping peer: " & msg) | ||
|
||
let timedOut = not await ping().withTimeout(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels to me that this inner ping()
function may be unnecessary and you can just dial and await the libp2pPing.ping()
directly? May be missing something here.
warn "error converting record to remote peer info", record=discNode.record | ||
continue | ||
# try to ping the peer | ||
if getTime().toUnix() >= allPeers[peerId].lastTimeConnected + ReconnectTime and allPeers[peerId].retries < MaxConnectionRetries: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest moving the logic under this out into a separate function. setConnectedPeersMetrics
is becoming diffult to follow and maintain.
Something like:
if dueForPing(allPeers[PeerId]):
let pingDelay = ping(allPeers[PeerId]).onError ...
My not be as simple as that, but I guess we should start separating concerns here as much as possible.
|
||
allPeers[peerId].lastTimeDiscovered = currentTime | ||
allPeers[peerId].enr = discNode.record.toURI() | ||
allPeers[peerId].enrCapabilities = discNode.record.getCapabilities().mapIt($it) | ||
allPeers[peerId].discovered += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps time to assign allPeers[peerId]
here to a variable? Just to improve readability rather than repeated allPeers[peerId]
throughout the rest of the proc.
@jm-clius I do agree with your comments and I am planning to address them, but I did not want to make this particular PR more complex than necessary. So I'd like to merge the addition of |
Description
This PR leverages
ping
protocol to measure latency in between the network monitor instance and discovered nodes.It records the latency as a
It also adds retry limit for failed connections and adds a cliff to repeatedly discovered nodes to only reconnect to them after 60s
Changes
int64
for timestamps instead of stringsReconnectTime
passedMaxConnectionRetries
ping()
proc to wrap dialling and executing ping protocol