-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] Avoid returning empty data on startup of a non-leader server #4554
[BUGFIX] Avoid returning empty data on startup of a non-leader server #4554
Conversation
Might avoid doing hashicorp/consul-template#1132 And might fix the following bugs: * hashicorp/consul-replicate#82 * hashicorp#3975 * hashicorp/consul-template#1131
Thanks to @vaLski to help with tests to solve this issue |
I confirm that this PR fix all three issues. Reproducer without this patch applied:
As soon as the patch is applied, follower will start answering stale queries with 5xx error, unless it contacted the leader at least once, thus having some consistent raft db version. That's the expected behavior. |
@mkeeler @banks @pearkes This fix is quite important as it lead us to outages (as it happens as well to @vaLski) This is basically a race condition in Server Code that leads stale request to return empty instead of an error if a client (re)connects too fast before the server could contact its leader. Thus, Consul returns false data (for instance empty kv, but we had the same issues a long time ago that cause major outage because a restarting server did return It will fix 3 issues at the same time :-)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pierresouchay @vaLski thanks this looks like a great find and fix. The test seems good although I need to check it through more carefully to be sure it's not going to be potentially flaky in CI - seems OK but I'm not totally sure on a quick glance.
I'm approving this because the logic seems good but we might not merge until later in the release cycle when we have a little more time to test it ourselves thoroughly!
@freddygv can you take a look over the test code and check that it doesn't seem to rely on any timing assumptions that will cause us problems?
@banks Thank you for the quick review @freddygv About the flakiness, it should be Ok since I used the exact same patterns as existing tests (that are not known to be flaky) and I tested the following way:
=> 80 consecutive runs without a single failure (It usually take around 5-6 runs to get a failure for unstable tests) |
t.Fatalf("bad: %#v", out.Services) | ||
} | ||
|
||
if out.Services["consul"] == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will ever be nil if the prior assertion for len(out.Services)
passes. Also, Services
maps a service to its tags, not its ID, according to this. So the stored value should be an empty slice.
It should be ok to remove this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
os.RemoveAll(dir1) | ||
|
||
args.AllowStale = false | ||
// Run the query, do not wait for leader, never any contact with leader, should fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update this comment, we have had contact with a leader, it's just that now we don't have one anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
args.AllowStale = false | ||
// Run the query, do not wait for leader, never any contact with leader, should fail | ||
if err := msgpackrpc.CallWithCodec(codec, "Catalog.ListServices", &args, &out); err == nil || err.Error() != structs.ErrNoLeader.Error() { | ||
t.Fatalf("expected %v but got err: %v and %v", structs.ErrNoLeader, err, out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spotted some flakiness here after re-running the agent/consul
job in Travis ~5 times: https://travis-ci.org/hashicorp/consul/jobs/418763773
Here is the error:
catalog_endpoint_test.go:1532: expected No cluster leader but got err: <nil> and {map[consul:[]] {42 0s true }}
The last true
in the slice above is the value of KnownLeader, so it seems that the result for the RPC is may be coming back before the heartbeat fails and the leader is removed.
Could this test be restructured so that it doesn't depend on the side-effects of Leave()
and Shutdown()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE, I added testrpc.WaitUntilNoLeader()
new test method in order to solve this kind of issues
ac88638
to
e069418
Compare
@freddygv In the first check, some unit tests did fail, but not related to my change: https://travis-ci.org/hashicorp/consul/jobs/419520868 |
Salute and big thanks to everyone involved in tracking and fixing this. Great job guys. Really \o/ |
Ensure that DB is properly initialized when performing stale queries
Might avoid doing hashicorp/consul-template#1132
And might fix the following bugs: