Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(networkmonitor): add ping latencies, optimize reconnections #2068

Merged
merged 1 commit into from
Sep 25, 2023

Conversation

vpavlin
Copy link
Member

@vpavlin vpavlin commented Sep 21, 2023

Description

This PR leverages ping protocol to measure latency in between the network monitor instance and discovered nodes.

It records the latency as a

  • histogram metric
  • last measured latency per node
  • average latency per node with sliding window of 10

It also adds retry limit for failed connections and adds a cliff to repeatedly discovered nodes to only reconnect to them after 60s

Changes

  • use int64 for timestamps instead of strings
  • add PingDuration (i.e. latency) fields
  • add retries (how many times we failed to connect) and discovered (how many times we discovered the node) field
  • only dial and ping the node if ReconnectTime passed
  • only dial and ping unless we are over the MaxConnectionRetries
  • use PeerId instead of public key
  • add ping() proc to wrap dialling and executing ping protocol
  • set default log level to INFO

@github-actions
Copy link

This PR may contain changes to configuration options of one of the apps.

If you are introducing a breaking change (i.e. the set of options in latest release would no longer be applicable) make sure the original option is preserved with a deprecation note for 2 following releases before it is actually removed.

Please also make sure the label release-notes is added to make sure any changes to the user interface are properly announced in changelog and release notes.

@github-actions
Copy link

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:2068

Built from 48c47df

Copy link
Contributor

@alrevuelta alrevuelta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! left couple of comments.

curious about the how the histogram looks like in the current network.

allPeers[peerId].connError = msg
return err("could not ping peer: " & msg)

let timedOut = not await ping().withTimeout(timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont fully understand proc ping(): Future[Result[void, string]] {.async, gcsafe.} =

and this let timedOut = not await ping().withTimeout(timeout)

isnt something like this better to handle the timeout?

if not (waitFornode.libp2pPing.ping(conn)).withTimeout(timeout)):
   xxx

and if it errors before timeout, that is handled in the except?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:D This was the first thing that came to mind when trying to overcome the weirdness of withTimeout which would not return any result from ping.

I'll rewrite it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels to me that this inner ping() function may be unnecessary and you can just dial and await the libp2pPing.ping() directly? May be missing something here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so looking at the code again, my problem was that I need the Connection from dial() to be able to call the ping, but without withTimeout on the dial it will hang for some nodes. But then if I add withTimeout I don't see a way to extract the returned Connection.

This is why I wrapped it in a proc and just set the variables directly from inside of the inner proc. If there is another (better) way to do it, I am happy to refactor, but I don't see

cc @jm-clius

allPeers[peerId].avgPingDuration = pingDelay

# TODO: check why the calculation ends up losing precision
allPeers[peerId].avgPingDuration = int64((float64(allPeers[peerId].avgPingDuration.millis) * (AvgPingWindow - 1.0) + float64(pingDelay.millis)) / AvgPingWindow).millis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this formula correct?
think its missing a /AvgPingWindow
old average * (n-1)/n + new value /n

something like this (unsure about the castings)

allPeers[peerId].avgPingDuration = int64((float64(allPeers[peerId].avgPingDuration.millis) * (AvgPingWindow - 1.0)/AvgPingWindow + float64(pingDelay.millis)) / AvgPingWindow).millis

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that would not be correct IMO - you need to make avg represent 9 values in the window + new value, what you have there is make avg*0.9 + new value and then divide by 10 (check the parentheses)

I believe my formula is correct, just some weird issue with casting/rounding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missread the parenthesis in your formula. yours and mine are the same.

what you have there is make avg*0.9 + new value and then divide by 10

nope, since division has priority over sum,
old_average * (n-1)/n + new_value /n

is equivalent to

old_average * (n-1)/n + (new_value /n)

Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as to not be a bottleneck, but made some comments below. Direction makes sense to me :)

Comment on lines +140 to +161
# after connection, get supported protocols
let lp2pPeerStore = node.switch.peerStore
let nodeProtocols = lp2pPeerStore[ProtoBook][peerInfo.peerId]
allPeers[peerId].supportedProtocols = nodeProtocols
allPeers[peerId].lastTimeConnected = currentTime

# after connection, get user-agent
let nodeUserAgent = lp2pPeerStore[AgentBook][peer.get().peerId]
allPeers[peerId].userAgent = nodeUserAgent
# after connection, get user-agent
let nodeUserAgent = lp2pPeerStore[AgentBook][peerInfo.peerId]
allPeers[peerId].userAgent = nodeUserAgent

# store avaiable protocols in the network
for protocol in nodeProtocols:
if not allProtocols.hasKey(protocol):
allProtocols[protocol] = 0
allProtocols[protocol] += 1
# store avaiable protocols in the network
for protocol in nodeProtocols:
if not allProtocols.hasKey(protocol):
allProtocols[protocol] = 0
allProtocols[protocol] += 1

# store available user-agents in the network
if not allAgentStrings.hasKey(nodeUserAgent):
allAgentStrings[nodeUserAgent] = 0
allAgentStrings[nodeUserAgent] += 1
# store available user-agents in the network
if not allAgentStrings.hasKey(nodeUserAgent):
allAgentStrings[nodeUserAgent] = 0
allAgentStrings[nodeUserAgent] += 1

debug "connected to peer", peer=allPeers[customPeerInfo.peerId]
debug "connected to peer", peer=allPeers[customPeerInfo.peerId]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict this extra indentation will now result in these being only set if the ping condition is met. Perhaps intentional?

allPeers[peerId].connError = msg
return err("could not ping peer: " & msg)

let timedOut = not await ping().withTimeout(timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels to me that this inner ping() function may be unnecessary and you can just dial and await the libp2pPing.ping() directly? May be missing something here.

warn "error converting record to remote peer info", record=discNode.record
continue
# try to ping the peer
if getTime().toUnix() >= allPeers[peerId].lastTimeConnected + ReconnectTime and allPeers[peerId].retries < MaxConnectionRetries:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest moving the logic under this out into a separate function. setConnectedPeersMetrics is becoming diffult to follow and maintain.
Something like:

if dueForPing(allPeers[PeerId]):
  let pingDelay = ping(allPeers[PeerId]).onError ...

My not be as simple as that, but I guess we should start separating concerns here as much as possible.


allPeers[peerId].lastTimeDiscovered = currentTime
allPeers[peerId].enr = discNode.record.toURI()
allPeers[peerId].enrCapabilities = discNode.record.getCapabilities().mapIt($it)
allPeers[peerId].discovered += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps time to assign allPeers[peerId] here to a variable? Just to improve readability rather than repeated allPeers[peerId] throughout the rest of the proc.

@vpavlin
Copy link
Member Author

vpavlin commented Sep 25, 2023

@jm-clius I do agree with your comments and I am planning to address them, but I did not want to make this particular PR more complex than necessary. So I'd like to merge the addition of ping and then come up with a proper refactoring PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants