Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flapping instances in EC2 #1335

Closed
wolftrouble opened this issue Oct 25, 2015 · 5 comments
Closed

Flapping instances in EC2 #1335

wolftrouble opened this issue Oct 25, 2015 · 5 comments

Comments

@wolftrouble
Copy link

Hi there, I've seen a few threads around this but nothing that addresses our situation.

In our case we have a 3-server cluster running in EC2 classic, with about 5 clients so far (this is all a POC). The clients are all low-utilization pre-prod boxes, so no CPU exhaustion or memory pressure problems. The servers are all dedicated m3.medium instances. There is no containerization involved.

The behavior we're seeing is what other people have reported: Clients dropping and re-joining constantly, typically every few seconds, with no indication the clients are suffering or having problems. Servers do NOT fail and re-join - this is all clients that are affected.

All security groups appear to be correct, as evidenced by plenty of UDP traffic back and forth on the serf ports (specifically 8301).

I spent today doing some tcpdumping to try and find anything interesting, and about all I can find is that there's nothing obviously interesting going on - I can't find evidence of dropped packets and parsing the sequence numbers indicates there's no missed acks that I can find. Also interestingly, I'm seeing nodes that have been supposedly dropped from the cluster (going by one of the servers' logs) still being used for indirect probes even while showing as a failed member - although again, this is from the view of one server, and I suspect if I time-correlated checks on other servers I'd find not all of them believe my client under test is dead.

I'm really at a loss to explain what's happening. All the suggestions I can find for people who've had this problem either relates to docker or an improper SG setup (specifically, not allowing UDP through) but I see plenty of (expected) UDP traffic between clients and servers. We have other clusters not running in EC2 that aren't demonstrating this problem. I'm running out of things to try, any suggestions?

@djenriquez
Copy link

hi @wolftrouble, have you seen issue #1212?

Docker has a known issue with their connection table for agents stopping and starting quickly, something related to UDP packets being dropped.

If you do recall stopping and starting the Consul containers, you may want to try running:

docker run --net=host --privileged --rm cap10morgan/conntrack -D -p udp
docker run --net=host --privileged --rm cap10morgan/conntrack -F

I have had troubles with nodes flapping, but only when cluster sizes reach 30+ nodes. We've mitigated this issue by creating several smaller clusters. Doesn't sound like you have a cluster that big yet.

@wolftrouble
Copy link
Author

Thanks DJ, but as I said in my description there's no Docker involved here. I was specifically mentioning that because many of the other issues I've seen around this seem to involve using Docker, but for us it's just 'bare' EC2 classic instances.

@djenriquez
Copy link

Ah sorry, I read docker and my brain mustve skipped the details and thought you were using it as well.

From what I can tell, you definitely have the compute/neworking to handle your POC.

Just to confirm, do you have the following ports open:

8300
8301
8301/udp
8302
8302/udp
8400
8500

@wolftrouble
Copy link
Author

Yep, confirmed. And confirmed not only are those ports open, but I can demonstrate we see regular traffic on the expected ports between all servers and the clients that are failing out while they're supposedly failing.

Unfortunately what I have not had time to do is try and pick apart the tcpdump output to see exactly what server thinks is going on and what traffic is actually flowing. All my testing so far has been inconclusive; typically what I see is 4-5 heartbeat back-and-forths with correct sequence numbers, mixed in with what looks like occasional joins (I'm going partially based on the code at https://github.com/hashicorp/memberlist/blob/master/memberlist.go, although I haven't tried to decode any of it yet, just reading hex dumps), and when clients drop out I don't see any outstanding heartbeats (which I'd expect).

I'll see if I can get better timelines from the dumps but the big smoking gun I was looking for was either out-of-sequence heartbeats or missing heartbeat responses, and I can't find either of those things. In fact, the more I pore over network traffic the less convinced I am that we're actually losing heartbeats, but I don't know what's actually wrong.

@slackpad
Copy link
Contributor

Closing this out as we've done a bunch of work under https://www.consul.io/docs/internals/gossip.html#lifeguard-enhancements to help with this, and we've seen good improvement in the wild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants