Flapping instances in EC2 #1335

wolftrouble · 2015-10-25T22:43:36Z

Hi there, I've seen a few threads around this but nothing that addresses our situation.

In our case we have a 3-server cluster running in EC2 classic, with about 5 clients so far (this is all a POC). The clients are all low-utilization pre-prod boxes, so no CPU exhaustion or memory pressure problems. The servers are all dedicated m3.medium instances. There is no containerization involved.

The behavior we're seeing is what other people have reported: Clients dropping and re-joining constantly, typically every few seconds, with no indication the clients are suffering or having problems. Servers do NOT fail and re-join - this is all clients that are affected.

All security groups appear to be correct, as evidenced by plenty of UDP traffic back and forth on the serf ports (specifically 8301).

I spent today doing some tcpdumping to try and find anything interesting, and about all I can find is that there's nothing obviously interesting going on - I can't find evidence of dropped packets and parsing the sequence numbers indicates there's no missed acks that I can find. Also interestingly, I'm seeing nodes that have been supposedly dropped from the cluster (going by one of the servers' logs) still being used for indirect probes even while showing as a failed member - although again, this is from the view of one server, and I suspect if I time-correlated checks on other servers I'd find not all of them believe my client under test is dead.

I'm really at a loss to explain what's happening. All the suggestions I can find for people who've had this problem either relates to docker or an improper SG setup (specifically, not allowing UDP through) but I see plenty of (expected) UDP traffic between clients and servers. We have other clusters not running in EC2 that aren't demonstrating this problem. I'm running out of things to try, any suggestions?

djenriquez · 2015-10-26T03:45:55Z

hi @wolftrouble, have you seen issue #1212?

Docker has a known issue with their connection table for agents stopping and starting quickly, something related to UDP packets being dropped.

If you do recall stopping and starting the Consul containers, you may want to try running:

docker run --net=host --privileged --rm cap10morgan/conntrack -D -p udp
docker run --net=host --privileged --rm cap10morgan/conntrack -F

I have had troubles with nodes flapping, but only when cluster sizes reach 30+ nodes. We've mitigated this issue by creating several smaller clusters. Doesn't sound like you have a cluster that big yet.

wolftrouble · 2015-10-26T18:13:57Z

Thanks DJ, but as I said in my description there's no Docker involved here. I was specifically mentioning that because many of the other issues I've seen around this seem to involve using Docker, but for us it's just 'bare' EC2 classic instances.

djenriquez · 2015-10-26T18:17:43Z

Ah sorry, I read docker and my brain mustve skipped the details and thought you were using it as well.

From what I can tell, you definitely have the compute/neworking to handle your POC.

Just to confirm, do you have the following ports open:

wolftrouble · 2015-10-27T00:35:28Z

Yep, confirmed. And confirmed not only are those ports open, but I can demonstrate we see regular traffic on the expected ports between all servers and the clients that are failing out while they're supposedly failing.

Unfortunately what I have not had time to do is try and pick apart the tcpdump output to see exactly what server thinks is going on and what traffic is actually flowing. All my testing so far has been inconclusive; typically what I see is 4-5 heartbeat back-and-forths with correct sequence numbers, mixed in with what looks like occasional joins (I'm going partially based on the code at https://github.com/hashicorp/memberlist/blob/master/memberlist.go, although I haven't tried to decode any of it yet, just reading hex dumps), and when clients drop out I don't see any outstanding heartbeats (which I'd expect).

I'll see if I can get better timelines from the dumps but the big smoking gun I was looking for was either out-of-sequence heartbeats or missing heartbeat responses, and I can't find either of those things. In fact, the more I pore over network traffic the less convinced I am that we're actually losing heartbeats, but I don't know what's actually wrong.

slackpad · 2017-04-12T08:33:08Z

Closing this out as we've done a bunch of work under https://www.consul.io/docs/internals/gossip.html#lifeguard-enhancements to help with this, and we've seen good improvement in the wild.

djenriquez mentioned this issue Oct 26, 2015

cluster goes into inconsistent state #1271

Closed

changwuf31 mentioned this issue Nov 15, 2015

Frequent membership loss for 50-100 node cluster in AWS #916

Closed

slackpad added the performance label Jan 8, 2016

slackpad closed this as completed Apr 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flapping instances in EC2 #1335

Flapping instances in EC2 #1335

wolftrouble commented Oct 25, 2015

djenriquez commented Oct 26, 2015

wolftrouble commented Oct 26, 2015

djenriquez commented Oct 26, 2015

wolftrouble commented Oct 27, 2015

slackpad commented Apr 12, 2017

Flapping instances in EC2 #1335

Flapping instances in EC2 #1335

Comments

wolftrouble commented Oct 25, 2015

djenriquez commented Oct 26, 2015

wolftrouble commented Oct 26, 2015

djenriquez commented Oct 26, 2015

wolftrouble commented Oct 27, 2015

slackpad commented Apr 12, 2017