Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious failures of cross-DC requests #5595

Closed
nahratzah opened this issue Apr 2, 2019 · 10 comments
Closed

Spurious failures of cross-DC requests #5595

nahratzah opened this issue Apr 2, 2019 · 10 comments
Labels
needs-investigation The issue described is detailed and complex.

Comments

@nahratzah
Copy link

nahratzah commented Apr 2, 2019

Overview of the Issue

Cross-datacenter requests sometimes fail, due to consul selecting a decommissioned server as its target.

We suspect this leads to failed KV requests, as well as DNS being super slow.

Reproduction Steps

1: create a set of consul clusters

Using the docker consul:1.2.3 image, create 4 consul clusters each with 5 servers.

docker-compose up -d --scale consul_nodes_dc1=4 --scale consul_nodes_dc2=4 --scale consul_nodes_dc3=4 --scale consul_nodes_dc4=4 --scale consul_nodes_dc5=4
*docker-compose.yml*: ``` version: "2" services: consul_dc1: image: consul:1.2.3 command: "agent -ui -server -client 0.0.0.0 -datacenter=dc1 -bootstrap-expect=5 -join consul_dc1" consul_nodes_dc1: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc1 -join consul_dc1" depends_on: - consul_dc1 consul_dc2: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc2 -join consul_dc2" consul_nodes_dc2: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc2 -join consul_dc2" depends_on: - consul_dc2 consul_dc3: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc3 -join consul_dc3" consul_nodes_dc3: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc3 -join consul_dc3" depends_on: - consul_dc3 consul_dc4: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc4 -join consul_dc4" consul_nodes_dc4: image: consul:1.2.3 command: "agent -server -bootstrap-expect=5 -datacenter=dc4 -join consul_dc4" depends_on: - consul_dc4 tribute: image: tribute ```
2: WAN join all these clusters
docker exec docker_consul_dc2_1 /bin/consul join -wan consul_dc1
docker exec docker_consul_dc3_1 /bin/consul join -wan consul_dc1
docker exec docker_consul_dc4_1 /bin/consul join -wan consul_dc1
3: create a KV in each cluster, for use later
for dc in dc1 dc2 dc3 dc4; do docker exec docker_consul_${dc}_1 /bin/consul kv put foobar "kv in ${dc}"; done
4: cycle through the nodes, creating lots and lots of failed nodes in the process
for x in {1,2,3,4,5,6}{1,2,3,4,5,6,7,8,9,0} ; do ./cycle.sh ; done

Note that we leave dc1 untouched, only affecting dc2, dc3, and dc4.

cycle.sh contents:
#!/bin/bash

for dc in dc2 dc3 dc4; do
  let "tribute_scale_count = 1 + $(docker-compose ps | grep -c tribute)"
  docker-compose scale consul_nodes_${dc}=3
  sleep 10
  docker-compose scale tribute=${tribute_scale_count}
  sleep 10
  docker-compose scale consul_nodes_${dc}=4
  sleep 10
done
5: Leave to simmer for 2+ days, while repeatedly querying the clusters

We collect the query log, because disk space is cheap.
(Use tmux so you can detach.)

while sleep 5; do ./query.sh ; done 2>&1 | tee query.log
query.sh contents: #!/bin/bash
for origin_dc in dc1 dc2 dc3 dc4 ; do
  for target_dc in dc1 dc2 dc3 dc4 dc5; do
    docker exec docker_consul_${origin_dc}_1 /bin/consul kv get -datacenter=${target_dc} foobar
  done
done
6: Observe spurious failure
docker-compose logs | grep 'rpc error: lead thread didn'\''t get connection'

Note, it took a couple of days for us to generate the log messages.
Have patience. <3

Also note the number of 500 HTTP statuses in query.log:

grep 500 query.log

Operating system and Environment details

uname -a: Linux z-kube-beach-avandersteldt-penguin 4.4.0-1077-aws #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

consul-1.2.3

Log Fragments

Output from step 5:

consul_dc4_1        |     2019/03/29 14:12:45 [ERR] consul: RPC failed to server 172.19.0.33:8300 in DC "dc2": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
consul_dc4_1        |     2019/03/29 14:12:45 [ERR] http: Request GET /v1/kv/foobar?dc=dc2, error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection from=127.0.0.1:57878
consul_dc2_1        |     2019/03/29 14:37:38 [ERR] consul: RPC failed to server 172.19.0.77:8300 in DC "dc4": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
consul_dc2_1        |     2019/03/29 14:37:38 [ERR] http: Request GET /v1/kv/foobar?dc=dc4, error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection from=127.0.0.1:52206
@pearkes
Copy link
Contributor

pearkes commented Apr 3, 2019

Thank you for the detailed bug report and reproduction. Can you summarize the state of the cluster at the end of this experiment? lead thread didn't get connection usually means there is a network or reachability issue of some kind. Consul should try a few different servers when it can't reach one, but perhaps at the end of the experiment there are none left that are healthy for some reason. It'd be good to clarify that.

@pearkes pearkes added needs-investigation The issue described is detailed and complex. waiting-reply Waiting on response from Original Poster or another individual in the thread labels Apr 3, 2019
@nahratzah
Copy link
Author

nahratzah commented Apr 4, 2019

For the entire duration of the test, except during step 4, the clusters are all healthy, with 5 out of 5 servers up.

I can describe it:

  1. Each consul-DC (dc1, dc2, dc3, and dc4) had 5 servers in the alive state.
  2. Each cycled consul-DC (dc2, dc3, and dc4) had 60 servers each in the failed state.
  3. The servers in the failed state disappeared after 72 hours had elapsed (which is why I can't simply copy/paste the output of consul members -wan for you).

During execution of step 4, there is at most one cluster operating at 4 out of 5 servers. A single server is decommissioned at line 5, and a replacement is commissioned at line 9, of cycle.sh.

There were no network problems: the entire docker installation ran locally and the machine (c4large) was pretty much idle. Apart from regular attempts by Serf to reconnect with the decommissioned machines, there was no mention of connection issues in the logs (this is the expected behaviour).

The lead thread didn't get connection messages point at decommissioned servers.

avandersteldt@z-kube-beach-avandersteldt-penguin:~/docker$ docker-compose logs | grep 'rpc error: lead thread didn'\''t get connection'                                                         
consul_dc4_1        |     2019/03/29 14:12:45 [ERR] consul: RPC failed to server 172.19.0.33:8300 in DC "dc2": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
consul_dc4_1        |     2019/03/29 14:12:45 [ERR] http: Request GET /v1/kv/foobar?dc=dc2, error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection from=127.0.0.1:57878
consul_dc2_1        |     2019/03/29 14:37:38 [ERR] consul: RPC failed to server 172.19.0.77:8300 in DC "dc4": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
consul_dc2_1        |     2019/03/29 14:37:38 [ERR] http: Request GET /v1/kv/foobar?dc=dc4, error: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection from=127.0.0.1:52206

In this log fragment, the servers 172.19.0.33 and 172.19.0.77 are both decommissioned during step 4. And these errors are not during the cycling either, but many hours later. Yet they are used as targets for the cross-DC query, originating from dc4 and dc2 respectively.

@pearkes pearkes removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Apr 11, 2019
@CpuID
Copy link

CpuID commented Apr 17, 2019

Pretty sure we are seeing the same thing here, have been seeing it for a while.

Some details:

  • 3 clusters with WAN federation between them
  • noticed cluster "ops" seems to be hanging onto an old entry for a "production" server node in serf
  • "ops" server node claims it reaps the failed "production" server node after 72hrs, but it keeps trying to hit it

Relevant log snippets from an "ops" server node:

user@ip-10-199-10-22:~# docker logs dc5310ea6d61 2>&1 | grep 10.202.0.5
    2019/03/19 21:02:38 [INFO] serf: EventMemberJoin: ip-10-202-0-5.aws-us-e1-production 10.202.0.5
    2019/03/19 21:02:38 [INFO] consul: Handled member-join event for server "ip-10-202-0-5.aws-us-e1-production" in area "wan"
    2019/03/20 00:11:09 [ERR] yamux: Failed to write header: write tcp 10.199.10.22:56998->10.202.0.5:8300: use of closed network connection
    2019/04/01 20:54:56 [INFO] memberlist: Suspect ip-10-202-0-5.aws-us-e1-production has failed, no acks received
    2019/04/01 20:55:17 [INFO] serf: EventMemberFailed: ip-10-202-0-5.aws-us-e1-production 10.202.0.5
    2019/04/01 20:55:17 [INFO] consul: Handled member-failed event for server "ip-10-202-0-5.aws-us-e1-production" in area "wan"
    2019/04/01 21:08:58 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/01 21:25:58 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
...
    2019/04/02 05:14:20 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/02 05:14:59 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/02 05:26:29 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/02 05:45:31 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/02 05:45:31 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
...
    2019/04/04 20:15:41 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/04 20:17:11 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/04 20:31:52 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/04 20:31:52 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/04 20:40:12 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/04 20:41:12 [INFO] serf: attempting reconnect to ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
    2019/04/04 20:53:37 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/04 20:55:18 [INFO] serf: EventMemberReap: ip-10-202-0-5.aws-us-e1-production
    2019/04/05 02:19:33 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/05 02:19:33 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/05 03:24:32 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/05 04:05:44 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
...
    2019/04/17 20:59:45 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/17 23:04:44 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    2019/04/17 23:04:44 [ERR] consul: RPC failed to server 10.202.0.5:8300 in DC "aws-us-e1-production": rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection

The evidence:

user@ip-10-199-10-22:~# docker exec -it dc5310ea6d61 cat /consul/data/serf/remote.snapshot | wc -l
67
user@ip-10-199-10-22:~# docker exec -it dc5310ea6d61 cat /consul/data/serf/remote.snapshot | grep 10.202.0.5
alive: ip-10-202-0-5.aws-us-e1-production 10.202.0.5:8304
not-alive: ip-10-202-0-5.aws-us-e1-production
user@ip-10-199-10-22:~# date
Wed Apr 17 23:53:49 UTC 2019

@CpuID
Copy link

CpuID commented Apr 17, 2019

Note: we are running Consul 1.0.3, I didn't spot anything obvious in the changelogs with a fix for this. My suspicion is if I replace the server node with a new one, this will all come good, and I'll try doing so now.

@CpuID
Copy link

CpuID commented Jul 4, 2019

@pearkes any feedback on this...?

@mkeeler
Copy link
Member

mkeeler commented Jul 4, 2019

@CpuID I believe this issue was fixed by #5317. That was in v1.4.3. I haven’t verified yet with the reproduction in this issue though so I will leave it open until the.

@CpuID
Copy link

CpuID commented Jul 4, 2019

@mkeeler I think you might be right actually, was reading through PRs and spotted that one today. I'm going to try get us upgraded to latest stable over the next week or so and see if the problem disappears permanently, thanks for following up :)

@stale
Copy link

stale bot commented Oct 21, 2019

Hey there,
We wanted to check in on this request since it has been inactive for at least 60 days.
If you think this is still an important issue in the latest version of Consul
or its documentation please reply with a comment here which will cause it to stay open for investigation.
If there is still no activity on this issue for 30 more days, we will go ahead and close it.

Feel free to check out the community forum as well!
Thank you!

@stale stale bot added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Oct 21, 2019
@stale
Copy link

stale bot commented Nov 20, 2019

Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one 👍

@stale stale bot closed this as completed Nov 20, 2019
@ghost
Copy link

ghost commented Jan 25, 2020

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one 👍.

@ghost ghost locked and limited conversation to collaborators Jan 25, 2020
@ghost ghost removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-investigation The issue described is detailed and complex.
Projects
None yet
Development

No branches or pull requests

4 participants