feat(networkmonitor): add ping latencies, optimize reconnections #2068

vpavlin · 2023-09-21T20:06:22Z

Description

This PR leverages ping protocol to measure latency in between the network monitor instance and discovered nodes.

It records the latency as a

histogram metric
last measured latency per node
average latency per node with sliding window of 10

It also adds retry limit for failed connections and adds a cliff to repeatedly discovered nodes to only reconnect to them after 60s

Changes

use int64 for timestamps instead of strings
add PingDuration (i.e. latency) fields
add retries (how many times we failed to connect) and discovered (how many times we discovered the node) field
only dial and ping the node if ReconnectTime passed
only dial and ping unless we are over the MaxConnectionRetries
use PeerId instead of public key
add ping() proc to wrap dialling and executing ping protocol
set default log level to INFO

github-actions · 2023-09-21T20:06:38Z

This PR may contain changes to configuration options of one of the apps.

If you are introducing a breaking change (i.e. the set of options in latest release would no longer be applicable) make sure the original option is preserved with a deprecation note for 2 following releases before it is actually removed.

Please also make sure the label release-notes is added to make sure any changes to the user interface are properly announced in changelog and release notes.

github-actions · 2023-09-21T20:11:45Z

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:2068

Built from 48c47df

alrevuelta

thanks! left couple of comments.

curious about the how the histogram looks like in the current network.

alrevuelta · 2023-09-22T07:11:46Z

apps/networkmonitor/networkmonitor.nim

+          allPeers[peerId].connError = msg
+          return err("could not ping peer: " & msg)
+
+      let timedOut = not await ping().withTimeout(timeout)


dont fully understand proc ping(): Future[Result[void, string]] {.async, gcsafe.} =

and this let timedOut = not await ping().withTimeout(timeout)

isnt something like this better to handle the timeout?

if not (waitFornode.libp2pPing.ping(conn)).withTimeout(timeout)): xxx

and if it errors before timeout, that is handled in the except?

Maybe:D This was the first thing that came to mind when trying to overcome the weirdness of withTimeout which would not return any result from ping.

I'll rewrite it

Feels to me that this inner ping() function may be unnecessary and you can just dial and await the libp2pPing.ping() directly? May be missing something here.

Ok, so looking at the code again, my problem was that I need the Connection from dial() to be able to call the ping, but without withTimeout on the dial it will hang for some nodes. But then if I add withTimeout I don't see a way to extract the returned Connection.

This is why I wrapped it in a proc and just set the variables directly from inside of the inner proc. If there is another (better) way to do it, I am happy to refactor, but I don't see

cc @jm-clius

alrevuelta · 2023-09-22T07:17:27Z

apps/networkmonitor/networkmonitor.nim

+        allPeers[peerId].avgPingDuration = pingDelay
+
+      # TODO: check why the calculation ends up losing precision
+      allPeers[peerId].avgPingDuration = int64((float64(allPeers[peerId].avgPingDuration.millis) * (AvgPingWindow - 1.0) + float64(pingDelay.millis)) / AvgPingWindow).millis


is this formula correct?
think its missing a /AvgPingWindow
old average * (n-1)/n + new value /n

something like this (unsure about the castings)

allPeers[peerId].avgPingDuration = int64((float64(allPeers[peerId].avgPingDuration.millis) * (AvgPingWindow - 1.0)/AvgPingWindow + float64(pingDelay.millis)) / AvgPingWindow).millis

No, that would not be correct IMO - you need to make avg represent 9 values in the window + new value, what you have there is make avg*0.9 + new value and then divide by 10 (check the parentheses)

I believe my formula is correct, just some weird issue with casting/rounding

missread the parenthesis in your formula. yours and mine are the same.

what you have there is make avg*0.9 + new value and then divide by 10

nope, since division has priority over sum,
old_average * (n-1)/n + new_value /n

is equivalent to

old_average * (n-1)/n + (new_value /n)

jm-clius

Approving as to not be a bottleneck, but made some comments below. Direction makes sense to me :)

jm-clius · 2023-09-22T11:03:42Z

apps/networkmonitor/networkmonitor.nim

+      # after connection, get supported protocols
+      let lp2pPeerStore = node.switch.peerStore
+      let nodeProtocols = lp2pPeerStore[ProtoBook][peerInfo.peerId]
+      allPeers[peerId].supportedProtocols = nodeProtocols
+      allPeers[peerId].lastTimeConnected = currentTime

-    # after connection, get user-agent
-    let nodeUserAgent = lp2pPeerStore[AgentBook][peer.get().peerId]
-    allPeers[peerId].userAgent = nodeUserAgent
+      # after connection, get user-agent
+      let nodeUserAgent = lp2pPeerStore[AgentBook][peerInfo.peerId]
+      allPeers[peerId].userAgent = nodeUserAgent

-    # store avaiable protocols in the network
-    for protocol in nodeProtocols:
-      if not allProtocols.hasKey(protocol):
-        allProtocols[protocol] = 0
-      allProtocols[protocol] += 1
+      # store avaiable protocols in the network
+      for protocol in nodeProtocols:
+        if not allProtocols.hasKey(protocol):
+          allProtocols[protocol] = 0
+        allProtocols[protocol] += 1

-    # store available user-agents in the network
-    if not allAgentStrings.hasKey(nodeUserAgent):
-      allAgentStrings[nodeUserAgent] = 0
-    allAgentStrings[nodeUserAgent] += 1
+      # store available user-agents in the network
+      if not allAgentStrings.hasKey(nodeUserAgent):
+        allAgentStrings[nodeUserAgent] = 0
+      allAgentStrings[nodeUserAgent] += 1

-    debug "connected to peer", peer=allPeers[customPeerInfo.peerId]
+      debug "connected to peer", peer=allPeers[customPeerInfo.peerId]


Afaict this extra indentation will now result in these being only set if the ping condition is met. Perhaps intentional?

jm-clius · 2023-09-22T11:05:02Z

apps/networkmonitor/networkmonitor.nim

+          allPeers[peerId].connError = msg
+          return err("could not ping peer: " & msg)
+
+      let timedOut = not await ping().withTimeout(timeout)


Feels to me that this inner ping() function may be unnecessary and you can just dial and await the libp2pPing.ping() directly? May be missing something here.

jm-clius · 2023-09-22T11:09:41Z

apps/networkmonitor/networkmonitor.nim

-      warn "error converting record to remote peer info", record=discNode.record
-      continue
+    # try to ping the peer
+    if getTime().toUnix() >= allPeers[peerId].lastTimeConnected + ReconnectTime and allPeers[peerId].retries < MaxConnectionRetries:


I'd suggest moving the logic under this out into a separate function. setConnectedPeersMetrics is becoming diffult to follow and maintain.
Something like:

if dueForPing(allPeers[PeerId]): let pingDelay = ping(allPeers[PeerId]).onError ...

My not be as simple as that, but I guess we should start separating concerns here as much as possible.

jm-clius · 2023-09-22T11:10:51Z

apps/networkmonitor/networkmonitor.nim


    allPeers[peerId].lastTimeDiscovered = currentTime
    allPeers[peerId].enr = discNode.record.toURI()
    allPeers[peerId].enrCapabilities = discNode.record.getCapabilities().mapIt($it)
+    allPeers[peerId].discovered += 1


Perhaps time to assign allPeers[peerId] here to a variable? Just to improve readability rather than repeated allPeers[peerId] throughout the rest of the proc.

vpavlin · 2023-09-25T09:32:13Z

@jm-clius I do agree with your comments and I am planning to address them, but I did not want to make this particular PR more complex than necessary. So I'd like to merge the addition of ping and then come up with a proper refactoring PR

feat(networkmonitor): add ping latencies, optimize reconnections

a6803e1

vpavlin requested a review from alrevuelta September 21, 2023 20:06

vpavlin requested review from hackyguru and jm-clius September 21, 2023 20:06

alrevuelta reviewed Sep 22, 2023

View reviewed changes

jm-clius approved these changes Sep 22, 2023

View reviewed changes

alrevuelta approved these changes Sep 25, 2023

View reviewed changes

vpavlin mentioned this pull request Sep 25, 2023

feat(wakucanary): add latency measurement using ping protocol #2074

Merged

3 tasks

vpavlin merged commit ed47354 into master Sep 25, 2023

vpavlin deleted the chore/network-monitor-retries branch September 25, 2023 12:39

vpavlin mentioned this pull request Sep 26, 2023

chore(networkmonitor): refactor setConnectedPeersMetrics, make it partially concurrent, add version #2080

Merged

4 tasks

s-tikhomirov pushed a commit that referenced this pull request Oct 6, 2023

feat(networkmonitor): add ping latencies, optimize reconnections (#2068)

3fb5a2b

s-tikhomirov pushed a commit that referenced this pull request Oct 6, 2023

feat(networkmonitor): add ping latencies, optimize reconnections (#2068)

cdfb72c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(networkmonitor): add ping latencies, optimize reconnections #2068

feat(networkmonitor): add ping latencies, optimize reconnections #2068

vpavlin commented Sep 21, 2023

github-actions bot commented Sep 21, 2023

github-actions bot commented Sep 21, 2023

alrevuelta left a comment

alrevuelta Sep 22, 2023

vpavlin Sep 22, 2023

jm-clius Sep 22, 2023

vpavlin Sep 25, 2023

alrevuelta Sep 22, 2023

vpavlin Sep 22, 2023

alrevuelta Sep 25, 2023

jm-clius left a comment

jm-clius Sep 22, 2023

jm-clius Sep 22, 2023

jm-clius Sep 22, 2023

jm-clius Sep 22, 2023

vpavlin commented Sep 25, 2023

feat(networkmonitor): add ping latencies, optimize reconnections #2068

feat(networkmonitor): add ping latencies, optimize reconnections #2068

Conversation

vpavlin commented Sep 21, 2023

Description

Changes

github-actions bot commented Sep 21, 2023

github-actions bot commented Sep 21, 2023

alrevuelta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jm-clius left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpavlin commented Sep 25, 2023