[BUGFIX] Avoid returning empty data on startup of a non-leader server #4554

pierresouchay · 2018-08-21T13:24:42Z

Ensure that DB is properly initialized when performing stale queries

Might avoid doing hashicorp/consul-template#1132

And might fix the following bugs:

consul-replicate sync empty data during raft leader election in the master DC consul-replicate#82
Flapping consul servers emit empty data for keyprefix watches leading to KV data loss when used with consul-replicate #3975
consul-template render empty data during consul server raft election/change consul-template#1131

Might avoid doing hashicorp/consul-template#1132 And might fix the following bugs: * hashicorp/consul-replicate#82 * hashicorp#3975 * hashicorp/consul-template#1131

…he leader

pierresouchay · 2018-08-21T13:47:56Z

Thanks to @vaLski to help with tests to solve this issue

vaLski · 2018-08-21T13:55:11Z

I confirm that this PR fix all three issues.

Reproducer without this patch applied:

create consul server leader
create consul server follower
write some data into the kv store

  for i in $(seq 1 10); do
    consul kv put pub/prefix/${i}
  done

initiate stale reads on the follower

  while true; do
    consul kv get -recurse -detailed  -stale pub/prefix/
    echo $?
    date +%s
    sleep 0.1
  done

go to the consul leader and block all incoming/outgoing traffic from/to follower
iptables -I INPUT -s follower.ip.ip.ip -j DROP
go to the follower
kill consul and start it again
shortly you will see the follower answering recurse queries with valid response code and empty payload
this behavior is unexpected. it should either return 5xx or return non-empty data.

As soon as the patch is applied, follower will start answering stale queries with 5xx error, unless it contacted the leader at least once, thus having some consistent raft db version. That's the expected behavior.

pierresouchay · 2018-08-21T16:49:49Z

@mkeeler @banks @pearkes This fix is quite important as it lead us to outages (as it happens as well to @vaLski)

This is basically a race condition in Server Code that leads stale request to return empty instead of an error if a client (re)connects too fast before the server could contact its leader. Thus, Consul returns false data (for instance empty kv, but we had the same issues a long time ago that cause major outage because a restarting server did return [] for list of nodes.

It will fix 3 issues at the same time :-)

banks

@pierresouchay @vaLski thanks this looks like a great find and fix. The test seems good although I need to check it through more carefully to be sure it's not going to be potentially flaky in CI - seems OK but I'm not totally sure on a quick glance.

I'm approving this because the logic seems good but we might not merge until later in the release cycle when we have a little more time to test it ourselves thoroughly!

@freddygv can you take a look over the test code and check that it doesn't seem to rely on any timing assumptions that will cause us problems?

pierresouchay · 2018-08-22T12:41:31Z

@banks Thank you for the quick review

@freddygv About the flakiness, it should be Ok since I used the exact same patterns as existing tests (that are not known to be flaky) and I tested the following way:

while go test -parallel 2 -timeout 30s github.com/hashicorp/consul/agent/consul -run '^TestCatalog_ListService(s_Stale|Nodes)$'; do go clean -testcache; done

=> 80 consecutive runs without a single failure (It usually take around 5-6 runs to get a failure for unstable tests)

freddygv · 2018-08-22T14:56:00Z

agent/consul/catalog_endpoint_test.go

+		t.Fatalf("bad: %#v", out.Services)
+	}
+
+	if out.Services["consul"] == nil {


I don't think this will ever be nil if the prior assertion for len(out.Services) passes. Also, Services maps a service to its tags, not its ID, according to this. So the stored value should be an empty slice.

It should be ok to remove this check.

freddygv · 2018-08-22T15:43:12Z

agent/consul/catalog_endpoint_test.go

+	os.RemoveAll(dir1)
+
+	args.AllowStale = false
+	// Run the query, do not wait for leader, never any contact with leader, should fail


Can you please update this comment, we have had contact with a leader, it's just that now we don't have one anymore.

freddygv · 2018-08-22T18:52:00Z

agent/consul/catalog_endpoint_test.go

+	args.AllowStale = false
+	// Run the query, do not wait for leader, never any contact with leader, should fail
+	if err := msgpackrpc.CallWithCodec(codec, "Catalog.ListServices", &args, &out); err == nil || err.Error() != structs.ErrNoLeader.Error() {
+		t.Fatalf("expected %v but got err: %v and %v", structs.ErrNoLeader, err, out)


I spotted some flakiness here after re-running the agent/consul job in Travis ~5 times: https://travis-ci.org/hashicorp/consul/jobs/418763773

Here is the error:
catalog_endpoint_test.go:1532: expected No cluster leader but got err: <nil> and {map[consul:[]] {42 0s true }}

The last true in the slice above is the value of KnownLeader, so it seems that the result for the RPC is may be coming back before the heartbeat fails and the leader is removed.

Could this test be restructured so that it doesn't depend on the side-effects of Leave() and Shutdown()?

DONE, I added testrpc.WaitUntilNoLeader() new test method in order to solve this kind of issues

pierresouchay · 2018-08-23T09:03:27Z

@freddygv In the first check, some unit tests did fail, but not related to my change: https://travis-ci.org/hashicorp/consul/jobs/419520868

vaLski · 2018-08-23T21:20:05Z

Salute and big thanks to everyone involved in tracking and fixing this. Great job guys. Really \o/

pierresouchay added 2 commits August 21, 2018 14:17

Ensure that DB is properly initialized when performing stale queries

fb1d3ec

Might avoid doing hashicorp/consul-template#1132 And might fix the following bugs: * hashicorp/consul-replicate#82 * hashicorp#3975 * hashicorp/consul-template#1131

Added Unit test + ensure that index is Applied

81c4335

pierresouchay changed the title ~~Stale reads on server init~~ Avoid Stale reads to return nothing without error on server follower init Aug 21, 2018

Use LastContact() to ensure we already discussed at least once with t…

53693fe

…he leader

Fixed stale test case to test fully stale requests with bug fixed

baf8ac4

pierresouchay changed the title ~~Avoid Stale reads to return nothing without error on server follower init~~ [BUGFIX] Avoid returning empty data on startup of a non-leader server Aug 21, 2018

banks approved these changes Aug 22, 2018

View reviewed changes

freddygv suggested changes Aug 22, 2018

View reviewed changes

More stable Unit Test / fixes according to @freddygv suggestions

e069418

pierresouchay force-pushed the stale_reads_on_server_init branch from ac88638 to e069418 Compare August 23, 2018 08:31

freddygv approved these changes Aug 23, 2018

View reviewed changes

freddygv merged commit b898131 into hashicorp:master Aug 23, 2018

This was referenced Aug 29, 2018

When upgrading to consul 1.2.2, agents run consul watch handlers #4610

Closed

consul watch handler runs on consul reload / restart #4609

Closed

alkalinecoffee mentioned this pull request Jul 22, 2020

Consul Upgrade with Replicate Results in Missing KVs #8351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] Avoid returning empty data on startup of a non-leader server #4554

[BUGFIX] Avoid returning empty data on startup of a non-leader server #4554

pierresouchay commented Aug 21, 2018

pierresouchay commented Aug 21, 2018

vaLski commented Aug 21, 2018 •

edited

Loading

pierresouchay commented Aug 21, 2018

banks left a comment

pierresouchay commented Aug 22, 2018 •

edited

Loading

freddygv Aug 22, 2018

pierresouchay Aug 23, 2018

freddygv Aug 22, 2018

pierresouchay Aug 23, 2018

freddygv Aug 22, 2018

pierresouchay Aug 23, 2018

pierresouchay commented Aug 23, 2018

vaLski commented Aug 23, 2018

[BUGFIX] Avoid returning empty data on startup of a non-leader server #4554

[BUGFIX] Avoid returning empty data on startup of a non-leader server #4554

Conversation

pierresouchay commented Aug 21, 2018

pierresouchay commented Aug 21, 2018

vaLski commented Aug 21, 2018 • edited Loading

pierresouchay commented Aug 21, 2018

banks left a comment

Choose a reason for hiding this comment

pierresouchay commented Aug 22, 2018 • edited Loading

freddygv Aug 22, 2018

Choose a reason for hiding this comment

pierresouchay Aug 23, 2018

Choose a reason for hiding this comment

freddygv Aug 22, 2018

Choose a reason for hiding this comment

pierresouchay Aug 23, 2018

Choose a reason for hiding this comment

freddygv Aug 22, 2018

Choose a reason for hiding this comment

pierresouchay Aug 23, 2018

Choose a reason for hiding this comment

pierresouchay commented Aug 23, 2018

vaLski commented Aug 23, 2018

vaLski commented Aug 21, 2018 •

edited

Loading

pierresouchay commented Aug 22, 2018 •

edited

Loading