Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

restarting a peer can lead to permanent global non-convergence and connection churn #1554

Closed
rade opened this issue Oct 17, 2015 · 17 comments
Closed
Assignees
Labels
Milestone

Comments

@rade
Copy link
Member

rade commented Oct 17, 2015

On a number of occasions we have seen log files that show connections shutting down with Multiple peers found with same name: ... and Connection appears to be with different version of a peer we already know of with associated significant levels of connection churn.

To date we considered these to be transitory conditions, caused by delays in peers detecting that one of their connections is stale, and thus still believing they are connected to the old incarnation of a restarted peer. Heartbeat timeouts should resolve that in less than ~1 minute.

However, users report that they have seen these errors persist for far longer than 1 minute.

@rade rade added the bug label Oct 17, 2015
@rade
Copy link
Member Author

rade commented Oct 17, 2015

I have a somewhat vague theory...

For a restarted peer to be permanently "accepted" into a network, all peers in that network must no longer hold a reference to the old incarnation. That requires a) all peers that were connected to that peer to detect the loss of connectivity to that peer, and b) for that information to propagate to all other peers. Furthermore, all this needs to happen before any peer accepts the new incarnation of a peer. Otherwise, in effect, the "old incarnation of peer has gone away" and "here is a (new incarnation of a) peer" facts race around the network in the topology gossip and whenever they collide end up dropping the gossiping connection.

I haven't totally convinced myself that convergence is impossible in some such circumstances, but it's certainly non-trivial to proof that it is.

@rade
Copy link
Member Author

rade commented Oct 17, 2015

Here is a proposed fix:

  • we get rid of all UID checks. In particular a peer will no longer drop the connection when discovering that
    • that the remote peer of a new connection has a different UID from what it currently knows, and
    • received topology gossip refers to a peer with a UID different from what it currently knows.
  • when updating peer information from topology gossip, we set the UID to that which was received.
  • when topology gossip contains information about ourself, if the received version number is greater than ours we set ours to that +1

The underlying idea is that knowledge in the network of an old incarnation of a peer will be replaced by information about the incarnation without requiring an intermediate step where peers are able to 'gc' the old incarnation.

@rade
Copy link
Member Author

rade commented Oct 17, 2015

Hmm. I think this means we can get rid of the UID completely.

@rade
Copy link
Member Author

rade commented Oct 18, 2015

  • when topology gossip contains information about ourself, if the received version number is greater than ours we set ours to that +1

There is an edge case here that prevents convergence in the absence of change: if the network has knowledge of an old incarnation at version V, and the new incarnation is also at version V, then the latter information will not supplant the former. That only happens when there is a further change to the peer, bumping the version to V+1.

I can think of two ways of dealing with that:

  • when topology gossip contains information about ourself, if the received version number is greater than ours we set ours to that +1; if it is equal to ours and the received UID is different from ours we increment our version
  • when topology gossip contains information about ourself, if the received version number is greater than ours we set ours to that +1; if it is equal to ours we compare the contents (i.e. the list of connections and short id) and if they differ we increment our version number

The first prevents us from getting rid of UIDs. The second is potentially expensive, especially considering that we will nearly always hit the comparison case while the local peer is not changing. We could eliminate the cost by adding a hash of the content, but arguably that is no better than just keeping UIDs.

And I reckon we also need to introduce the following tie-breaker:

  • when receiving information from topology gossip about a peer, if the versions are the same we update our information if the received UID is greater than ours

That way there is convergence in the "same version but different UID" case, albeit possibly to the "wrong" state, i.e. using information from an old incarnation, but this will only be a temporary aberration.

@rade
Copy link
Member Author

rade commented Oct 18, 2015

Backward compatibility is an issue here, even if we keep UIDs. The problem is that, as explained, current routers very much depend on the entire network forgetting about a peer first before accept a new incarnation of a peer. And the disconnects on UID mismatch rather help in creating such a situation. New routers wouldn't do that.

One possibility here is to keep the disconnecting behaviour for direct connections, i.e. the Connection appears to be with different version of a peer we already know of case. This would increase the chances substantially that the old incarnation will be "flushed" from the topology across the network. But it doesn't guarantee that. Indeed this whole issue exist precisely because we cannot guarantee that even with all the existing disconnection behaviour.

We could alternatively/additionally advise users to "wait a while" when restarting a peer.

@bboreham
Copy link
Contributor

I cannot manage to come up with a scenario where the old peer is not forgotten.

For example, suppose we have peers A, B, C, D, connected in a ring, i.e. A<->B<->C<->D<->E and A<->E. At the start of this analysis, all peers know about all other peers.

  • C is restarted, with same peer name and different UID. Call this C2.
  • B passes gossip to A containing info on C2 and B<->C2 connection.
  • connection from A to B is instantly dropped as C2 clashes with C.

A does not garbage-collect C as it still sees a connection via E.

  • D passes gossip to E containing info on C2.
  • connection from D to E is dropped as C2 clashes with C.

E does not garbage-collect C as it still sees a connection via A.

However, A will now gossip to E to say that its connection to B is gone, and once that happens, E will garbage-collect C and accept the new C2.

Now A will drop the connection to E because E's knowledge of C2 clashes, and so A has no connections to anywhere and will garbage-collect everything.

We see peer version numbers in the millions in logs, so connections were being added and removed very frequently. #1496 should slow this down, and perhaps allow us to understand the real problem underlying this report.

@rade
Copy link
Member Author

rade commented Oct 19, 2015

A will now gossip to E to say that its connection to B is gone

Not if it since has been re-established.

#1496 should slow this down, and perhaps allow us to understand the real problem underlying this report.

#1496 will help. It may even make the problem go away. However, the scheme we have for dealing with re-starting peers is not obviously convergent, since it in effect relies on the transmission of events rather than states. And it couples the topology gossip in very intricate ways to connection lifecycle, namely closing connections on certain error conditions is required for convergence.

My proposal fix addresses that head on.

@bboreham
Copy link
Contributor

A will now gossip to E to say that its connection to B is gone

Not if it since has been re-established.

How can this happen? As long as B is telling A about C2, A will disconnect from B.

@rade
Copy link
Member Author

rade commented Oct 19, 2015

How can this happen? As long as B is telling A about C2, A will disconnect from B.

The gossip is received some time after connection establishment.

@bboreham
Copy link
Contributor

OK, but some time after that A will tell E that A->B is gone, or A will try to tell E about C2 and E will drop E->A. Either way, E will lose all connectivity to C and garbage-collect it.

@rade
Copy link
Member Author

rade commented Oct 19, 2015

but some time after that A will tell E that A->B is gone

Not if the connection has been re-established. If you have two peers who continuously attempt to connect each other and succeed, only for the connection to be dropped due to some gossip, it is possible that the "disconnected" state is never propagated to other peers.

@rade
Copy link
Member Author

rade commented Oct 19, 2015

My broader point though is that gossip convergence should not depend on careful orchestration of the underlay connectivity. That's just asking for trouble.

@bboreham
Copy link
Contributor

OK, looking again at the logs I see that it is typical for the collection to last 10-40ms before being torn down, while the delay before reconnection is perhaps 2-10ms. With many peers, the chance that at least one of them claims to have a path to the old peer is increased.

FWIW, I couldn't get this from the description:

Otherwise, in effect, the "old incarnation of peer has gone away" and "here is a (new incarnation of a) peer" facts race around the network in the topology gossip

What about:

Otherwise, in effect, the "this connection to an old peer has gone away" and "I have a new connection which I still think leads to the old peer" facts race around the network in the topology gossip

?

And

all this needs to happen before any peer accepts the new incarnation of a peer

is disproved by my first attempt at a counter-example.

@rade
Copy link
Member Author

rade commented Oct 19, 2015

Otherwise, in effect, the "this connection to an old peer has gone away" and "I have a new connection which I still think leads to the old peer" facts race around the network in the topology gossip

The two facts I mentioned are the new facts. They both race around the network, together with gossip about the old state, i.e. the fact you mention.

is disproved by my first attempt at a counter-example

It isn't.

@rade
Copy link
Member Author

rade commented Oct 21, 2015

re backward compatibility

One possibility here is to keep the disconnecting behaviour for direct connections, i.e. the Connection appears to be with different version of a peer we already know of case. This would increase the chances substantially that the old incarnation will be "flushed" from the topology across the network.

@bboreham and I are not convinced this really helps, and it delays acceptance of new peers.

There is an interesting question here though... What should we do when accepting such a connection? In particular, should we update the UID? That would be oddly non-monotonic. We could have existing connections to the old incarnation that we haven't noticed dropping yet, the peer might be bouncing rapidly, etc. If we update the UID we hold for the peer, it would be down to luck to end up with the most recent one. Also, the version would be bogus.

So instead I propose we simply leave the UID alone and let gossip update it eventually.

@rade rade self-assigned this Oct 22, 2015
@rade rade added this to the 1.3.0 milestone Oct 29, 2015
@rade
Copy link
Member Author

rade commented Nov 5, 2015

I have finally managed to reproduce this. Sort of.

First apply the following patch...

diff --git a/router/gossip_channel.go b/router/gossip_channel.go
index 11e8e8a..7bb72d2 100644
--- a/router/gossip_channel.go
+++ b/router/gossip_channel.go
@@ -4,7 +4,10 @@ import (
        "bytes"
        "encoding/gob"
        "fmt"
+       "strings"
        "sync"
+
+       "github.com/weaveworks/weave/common/debug"
 )

 type GossipChannel struct {
@@ -156,6 +159,10 @@ func (c *GossipChannel) sendDown(conn Connection, data GossipData) {
        sender, found := c.senders[conn]
        if !found {
                sender = NewGossipSender(func(pending GossipData) {
+                       if c.name == "topology" &&
+                               (strings.Contains(c.ourself.NickName, "host1") || strings.Contains(conn.Remote().NickName, "host1")) {
+                               debug.Break("[gossip %s]: sending to %s", c.name, conn.Remote())
+                       }
                        for _, msg := range pending.Encode() {
                                if len(msg) > maxFeasibleMessageLen {
                                        panic(fmt.Sprintf("Gossip message too large: len=%d bytes; on channel '%s' from %+v", len(msg), c.name, pending))
@@ -231,6 +238,10 @@ func (c *GossipChannel) sendBroadcast(srcName PeerName, update GossipData) {
                protocolMsg := ProtocolMsg{ProtocolGossipBroadcast, GobEncode(c.name, srcName, msg)}
                // FIXME a single blocked connection can stall us
                for _, conn := range connections {
+                       if c.name == "topology" && strings.Contains(conn.Remote().NickName, "host1") {
+                               c.log("DROPPING broadcast to", conn.Remote().String(), "from", srcName)
+                               continue
+                       }
                        conn.(ProtocolSender).SendProtocolMsg(protocolMsg)
                }
        }

This also requires weave/common/debug/debug.go:

package debug

import (
    "fmt"
    "os"
    "os/signal"
    "syscall"
)

type breakpoint struct {
    format string
    args   []interface{}
    resume chan<- struct{}
}

type debugger struct {
    enabled     bool
    breakpoints []breakpoint
    suspend     chan<- breakpoint
}

var d debugger

func init() {
    suspend := make(chan breakpoint, 0)
    d = debugger{false, nil, suspend}
    sigs := make(chan os.Signal, 1)
    signal.Notify(sigs, syscall.SIGUSR1, syscall.SIGUSR2)
    go func() {
        for {
            select {
            case sig := <-sigs:
                switch sig {
                case syscall.SIGUSR1:
                    d.enabled = !d.enabled
                    if !d.enabled {
                        fmt.Fprintln(os.Stderr, "DEBUG: breakpoints disabled")
                        resume()
                    } else {
                        fmt.Fprintln(os.Stderr, "DEBUG: breakpoints enabled")
                    }
                case syscall.SIGUSR2:
                    resume()
                }
            case bp := <-suspend:
                if d.enabled {
                    fmt.Fprintf(os.Stderr, "BREAK "+bp.format+"\n", bp.args...)
                    d.breakpoints = append(d.breakpoints, bp)
                } else {
                    bp.resume <- struct{}{}
                }
            }
        }
    }()
}

func resume() {
    for _, bp := range d.breakpoints {
        fmt.Fprintf(os.Stderr, "RESUME "+bp.format+"\n", bp.args...)
        bp.resume <- struct{}{}
    }
    d.breakpoints = nil
}

func Break(format string, args ...interface{}) {
    resume := make(chan struct{}, 0)
    d.suspend <- breakpoint{format, args, resume}
    <-resume
}

Then establish the following topology using your desktop machine and our three vagrant VMs:

        host2
       /  |  \
desktop   |   host1
       \  |  /
        host3

I do this by

  1. host1: weave launch --no-discovery
  2. host2: weave launch --no-discovery 192.168.48.11
  3. host3: weave launch --no-discovery 192.168.48.11 192.168.48.12
  4. desktop: weave launch --no-discovery 192.168.48.12 192.168.48.13

The patch above

  • drops topology broadcast to host1
  • allows us to pause the topology gossip to and from host1

Once everything is settled (check with weave status on each node, to see 4 peers (with 10 established connections))...

  1. host1: WPID=$(docker inspect -f '{{.State.Pid}}' weave); sudo kill -USR1 $WPID
  2. host2: WPID=$(docker inspect -f '{{.State.Pid}}' weave); sudo kill -USR1 $WPID
  3. host3: WPID=$(docker inspect -f '{{.State.Pid}}' weave); sudo kill -USR1 $WPID

This pauses topology gossip between host2/3 and host1.

Next we wait for the "breakpoints" to hit - there will be messages like

BREAK [gossip topology]: sending to be:84:4f:13:85:c0(host1)

in the host1/2/3 weave docker container logs.

Once that has happened we know that any subsequent gossip will be merged into the existing gossip data held by the GossipSenders.

We now restart weave on our desktop:

  1. desktop: weave stop
  2. desktop: weave launch --no-discovery 192.168.48.12 192.168.48.13

Wait for a while for things to settle. Then re-enable gossip from host2 to host1:

  1. host2: sudo kill -USR1 $WPID

Watch the host1 and host2 logs. Eventually - it takes a while because of the continued dropping of topology gossip from host1 and topology broadcast to host1 - we see the connection between the two getting dropped with something like

ERRO: 2015/11/05 11:10:03.537339 ->[192.168.48.12:37340|1e:b9:3c:eb:14:b7(host2)]: connection shutting down due to error: Multiple peers found with same name: 0a:f7:0d:d0:03:df

Eventually host2 will reconnect to host1, and, after a while, the same error occurs. And so on, indefinitely.

@rade
Copy link
Member Author

rade commented Nov 5, 2015

The test demonstrates that some extreme scheduling can result in indefinite non-convergence. Essentially host1 never forgets about the old incarnation of the desktop peer because it thinks it has a route to it via host3. The situation resolves itself when we re-enable the gossip from host3 to host1. That too will result in the connection getting dropped with the above error. But crucially at that point host1 no longer has any connection to either host2 or host3, hence no route to the desktop peer, and hence it garbage-collects it. Next time host2 or host3 connect to it, no conflict is detected.

Note though that because the conflict is detected some time after the connection is established - on receipt of some gossip - it is possible, I think, for the connections from host2 and host3 to host1 to always overlap, i.e. there always being a connection to at least one of them, which in turn prevents garbage-collection of the desktop peer from host1. This situation would have been much more likely pre 1.2.0, due to the lack of backoff on late connection failures.

The test also shows that just network disruption can at least cause prolonged, but temporary, non-convergence. The dropping and pausing of topology gossip/broadcast we introduce could equally just be indefinite queueing due to TCP buffers filling up. However, any such disruption would cause the connections to be terminated after at most heartbeat-interval * 2. And this in turn would cause host1 to garbage-collect the stale information about the desktop peer.

rade added a commit that referenced this issue Nov 5, 2015
Previously when a peer restarted, information about the new
incarnation (i.e. with a different UID) was not accepted by other
peers (and connections would be dropped) unless all knowledge of the
previous incarnation had been purged. This could result in a lot of
connection churn and hence connectivity disruption, and, in some
pathological cases, very slow convergence and hence acceptance of the
new incarnation into the network.

We now no longer drop connections when encountering different
incarnations of a peer. There are two situations when that can happen:

1) on connection establishment

we simply proceed

2) on receipt of gossip

to ensure convergence we

a) treat the UID as an additional discriminator when deciding whether
we should update our information about a peer with that which was
gossiped. Specifically, we update the information we hold when a) the
gossiped version is greater, or b) is the same and the UID is
greater.

b) include the UID in the information we update

c) move our own version number beyond any we receive for ourselves, if
the received UID differs from ours.

With (a) we establishes a total order of peer information across
several incarnations of the same peer. i.e. we consider information to
be fresher if it has a higher version, or the same version and higher
UID. This may seem somewhat counter intutive, since it will generally
treat information about new incarnations as older than old
incarnations, since incarnations always start life with version 1. But
to do better we'd need to establish a total order of incarnations that
matches their temporal occurrence. Which requires some sort of durable
state.

So instead we have (c). Through that we learn the highest version
number of any old incarnation of ourselves that other peers still
hold, and then make sure that our version is greater than
that. Essentially we continue where the old incarnations left
off. It's as if instead of restarting we had simply changed UIDs. And
due to (a) and (b) the information about the new incarnation of
ourselves, now with a higher version, will supersede that of the old
incarnations.

Fixes #1554.
@awh awh removed this from the 1.3.0 milestone Nov 9, 2015
rade added a commit that referenced this issue Nov 9, 2015
Previously when a peer restarted, information about the new
incarnation (i.e. with a different UID) was not accepted by other
peers (and connections would be dropped) unless all knowledge of the
previous incarnation had been purged. This could result in a lot of
connection churn and hence connectivity disruption, and, in some
pathological cases, very slow convergence and hence acceptance of the
new incarnation into the network.

We now no longer drop connections when encountering different
incarnations of a peer. There are two situations when that can happen:

1) on connection establishment

we simply proceed

2) on receipt of gossip

to ensure convergence we

a) treat the UID as an additional discriminator when deciding whether
we should update our information about a peer with that which was
gossiped. Specifically, we update the information we hold when a) the
gossiped version is greater, or b) is the same and the UID is
greater.

b) include the UID in the information we update

c) move our own version number beyond any we receive for ourselves, if
the received UID differs from ours.

With (a) we establishes a total order of peer information across
several incarnations of the same peer. i.e. we consider information to
be fresher if it has a higher version, or the same version and higher
UID. This may seem somewhat counter intutive, since it will generally
treat information about new incarnations as older than old
incarnations, since incarnations always start life with version 1. But
to do better we'd need to establish a total order of incarnations that
matches their temporal occurrence. Which requires some sort of durable
state.

So instead we have (c). Through that we learn the highest version
number of any old incarnation of ourselves that other peers still
hold, and then make sure that our version is greater than
that. Essentially we continue where the old incarnations left
off. It's as if instead of restarting we had simply changed UIDs. And
due to (a) and (b) the information about the new incarnation of
ourselves, now with a higher version, will supersede that of the old
incarnations.

Fixes #1554.
rade added a commit that referenced this issue Nov 9, 2015
Previously when a peer restarted, information about the new
incarnation (i.e. with a different UID) was not accepted by other
peers (and connections would be dropped) unless all knowledge of the
previous incarnation had been purged. This could result in a lot of
connection churn and hence connectivity disruption, and, in some
pathological cases, very slow convergence and hence acceptance of the
new incarnation into the network.

We now no longer drop connections when encountering different
incarnations of a peer. There are two situations when that can happen:

1) on connection establishment

we simply proceed

2) on receipt of gossip

to ensure convergence we

a) treat the UID as an additional discriminator when deciding whether
we should update our information about a peer with that which was
gossiped. Specifically, we update the information we hold when a) the
gossiped version is greater, or b) is the same and the UID is
greater.

b) include the UID in the information we update

c) move our own version number beyond any we receive for ourselves, if
the received UID differs from ours.

With (a) we establishes a total order of peer information across
several incarnations of the same peer. i.e. we consider information to
be fresher if it has a higher version, or the same version and higher
UID. This may seem somewhat counter intutive, since it will generally
treat information about new incarnations as older than old
incarnations, since incarnations always start life with version 1. But
to do better we'd need to establish a total order of incarnations that
matches their temporal occurrence. Which requires some sort of durable
state.

So instead we have (c). Through that we learn the highest version
number of any old incarnation of ourselves that other peers still
hold, and then make sure that our version is greater than
that. Essentially we continue where the old incarnations left
off. It's as if instead of restarting we had simply changed UIDs. And
due to (a) and (b) the information about the new incarnation of
ourselves, now with a higher version, will supersede that of the old
incarnations.

Fixes #1554.
bboreham added a commit that referenced this issue Nov 9, 2015
…ange

better convergence when peers restart; fixes #1554.
@rade rade modified the milestone: 1.3.0 Nov 11, 2015
@awh awh removed this from the 1.4.0 milestone Nov 12, 2015
@awh awh modified the milestones: 1.3.0, 1.4.0 Nov 12, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants