-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spurious failures of cross-DC requests #5595
Comments
Thank you for the detailed bug report and reproduction. Can you summarize the state of the cluster at the end of this experiment? |
For the entire duration of the test, except during step 4, the clusters are all healthy, with 5 out of 5 servers up. I can describe it:
During execution of step 4, there is at most one cluster operating at 4 out of 5 servers. A single server is decommissioned at line 5, and a replacement is commissioned at line 9, of There were no network problems: the entire docker installation ran locally and the machine ( The
In this log fragment, the servers |
Pretty sure we are seeing the same thing here, have been seeing it for a while. Some details:
Relevant log snippets from an "ops" server node:
The evidence:
|
Note: we are running Consul 1.0.3, I didn't spot anything obvious in the changelogs with a fix for this. My suspicion is if I replace the server node with a new one, this will all come good, and I'll try doing so now. |
@pearkes any feedback on this...? |
@mkeeler I think you might be right actually, was reading through PRs and spotted that one today. I'm going to try get us upgraded to latest stable over the next week or so and see if the problem disappears permanently, thanks for following up :) |
Hey there, Feel free to check out the community forum as well! |
Hey there, This issue has been automatically closed because there hasn't been any activity for at least 90 days. If you are still experiencing problems, or still have questions, feel free to open a new one 👍 |
Hey there, This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days. If you are still experiencing problems, or still have questions, feel free to open a new one 👍. |
Overview of the Issue
Cross-datacenter requests sometimes fail, due to consul selecting a decommissioned server as its target.
We suspect this leads to failed KV requests, as well as DNS being super slow.
Reproduction Steps
1: create a set of consul clusters
Using the docker
consul:1.2.3
image, create 4 consul clusters each with 5 servers.2: WAN join all these clusters
3: create a KV in each cluster, for use later
4: cycle through the nodes, creating lots and lots of failed nodes in the process
Note that we leave
dc1
untouched, only affectingdc2
,dc3
, anddc4
.5: Leave to simmer for 2+ days, while repeatedly querying the clusters
We collect the query log, because disk space is cheap.
(Use tmux so you can detach.)
6: Observe spurious failure
Note, it took a couple of days for us to generate the log messages.
Have patience. <3
Also note the number of
500
HTTP statuses inquery.log
:Operating system and Environment details
uname -a
:Linux z-kube-beach-avandersteldt-penguin 4.4.0-1077-aws #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
consul-1.2.3
Log Fragments
Output from step 5:
The text was updated successfully, but these errors were encountered: