Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky test: TestAgent_ReloadConfigAndKeepChecksStatus #7425

Closed
rboyer opened this issue Mar 10, 2020 · 2 comments · Fixed by #7490
Closed

flaky test: TestAgent_ReloadConfigAndKeepChecksStatus #7425

rboyer opened this issue Mar 10, 2020 · 2 comments · Fixed by #7490
Labels
theme/testing Testing, and related enhancements

Comments

@rboyer
Copy link
Member

rboyer commented Mar 10, 2020

https://circleci.com/gh/hashicorp/consul/139535

--- FAIL: TestAgent_ReloadConfigAndKeepChecksStatus (7.23s)
    writer.go:29: 2020-03-10T15:42:21.307Z [WARN]  TestAgent_ReloadConfigAndKeepChecksStatus: bootstrap = true: do not enable unless necessary
    writer.go:29: 2020-03-10T15:42:21.308Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:8ad7be10-d3a2-41df-f7f1-1343955446ad Address:127.0.0.1:10113}]"
    writer.go:29: 2020-03-10T15:42:21.308Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server.raft: entering follower state: follower="Node at 127.0.0.1:10113 [Follower]" leader=
    writer.go:29: 2020-03-10T15:42:21.309Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server.serf.wan: serf: EventMemberJoin: Node-8ad7be10-d3a2-41df-f7f1-1343955446ad.dc1 127.0.0.1
    writer.go:29: 2020-03-10T15:42:21.309Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server.serf.lan: serf: EventMemberJoin: Node-8ad7be10-d3a2-41df-f7f1-1343955446ad 127.0.0.1
    writer.go:29: 2020-03-10T15:42:21.309Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server: Handled event for server in area: event=member-join server=Node-8ad7be10-d3a2-41df-f7f1-1343955446ad.dc1 area=wan
    writer.go:29: 2020-03-10T15:42:21.309Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server: Adding LAN server: server="Node-8ad7be10-d3a2-41df-f7f1-1343955446ad (Addr: tcp/127.0.0.1:10113) (DC: dc1)"
    writer.go:29: 2020-03-10T15:42:21.310Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Started DNS server: address=127.0.0.1:10122 network=udp
    writer.go:29: 2020-03-10T15:42:21.310Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Started DNS server: address=127.0.0.1:10122 network=tcp
    writer.go:29: 2020-03-10T15:42:21.310Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Started HTTP server: address=127.0.0.1:10123 network=tcp
    writer.go:29: 2020-03-10T15:42:21.310Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: started state syncer
    writer.go:29: 2020-03-10T15:42:21.310Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Started gRPC server: address=127.0.0.1:10114 network=tcp
    writer.go:29: 2020-03-10T15:42:21.360Z [WARN]  TestAgent_ReloadConfigAndKeepChecksStatus.server.raft: heartbeat timeout reached, starting election: last-leader=
    writer.go:29: 2020-03-10T15:42:21.360Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server.raft: entering candidate state: node="Node at 127.0.0.1:10113 [Candidate]" term=2
    writer.go:29: 2020-03-10T15:42:21.360Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus.server.raft: votes: needed=1
    writer.go:29: 2020-03-10T15:42:21.360Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus.server.raft: vote granted: from=8ad7be10-d3a2-41df-f7f1-1343955446ad term=2 tally=1
    writer.go:29: 2020-03-10T15:42:21.360Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server.raft: election won: tally=1
    writer.go:29: 2020-03-10T15:42:21.360Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server.raft: entering leader state: leader="Node at 127.0.0.1:10113 [Leader]"
    writer.go:29: 2020-03-10T15:42:21.360Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server: cluster leadership acquired
    writer.go:29: 2020-03-10T15:42:21.360Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server: New leader elected: payload=Node-8ad7be10-d3a2-41df-f7f1-1343955446ad
    writer.go:29: 2020-03-10T15:42:21.361Z [DEBUG] connect.ca.consul: consul CA provider configured: id=07:80:c8:de:f6:41:86:29:8f:9c:b8:17:d6:48:c2:d5:c5:5c:7f:0c:03:f7:cf:97:5a:a7:c1:68:aa:23:ae:81 is_primary=true
    writer.go:29: 2020-03-10T15:42:21.362Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server.connect: initialized primary datacenter CA with provider: provider=consul
    writer.go:29: 2020-03-10T15:42:21.362Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.leader: started routine: routine="federation state anti-entropy"
    writer.go:29: 2020-03-10T15:42:21.362Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.leader: started routine: routine="federation state pruning"
    writer.go:29: 2020-03-10T15:42:21.362Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.leader: started routine: routine="CA root pruning"
    writer.go:29: 2020-03-10T15:42:21.362Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus.server: Skipping self join check for node since the cluster is too small: node=Node-8ad7be10-d3a2-41df-f7f1-1343955446ad
    writer.go:29: 2020-03-10T15:42:21.362Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server: member joined, marking health alive: member=Node-8ad7be10-d3a2-41df-f7f1-1343955446ad
    writer.go:29: 2020-03-10T15:42:21.362Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server: federation state anti-entropy synced
    writer.go:29: 2020-03-10T15:42:21.363Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Skipping remote check since it is managed automatically: check=serfHealth
    writer.go:29: 2020-03-10T15:42:21.364Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Synced node info
    writer.go:29: 2020-03-10T15:42:21.364Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Synced service: service=webserver1
    writer.go:29: 2020-03-10T15:42:21.364Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check in sync: check=service:webserver1
    writer.go:29: 2020-03-10T15:42:21.364Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Node info in sync
    writer.go:29: 2020-03-10T15:42:21.364Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Service in sync: service=webserver1
    writer.go:29: 2020-03-10T15:42:21.364Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check in sync: check=service:webserver1
    writer.go:29: 2020-03-10T15:42:21.414Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check status updated: check=service:webserver1 status=passing
    writer.go:29: 2020-03-10T15:42:22.414Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check status updated: check=service:webserver1 status=passing
    writer.go:29: 2020-03-10T15:42:22.708Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Skipping remote check since it is managed automatically: check=serfHealth
    writer.go:29: 2020-03-10T15:42:22.708Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Node info in sync
    writer.go:29: 2020-03-10T15:42:22.708Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Service in sync: service=webserver1
    writer.go:29: 2020-03-10T15:42:22.708Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Synced check: check=service:webserver1
    writer.go:29: 2020-03-10T15:42:22.708Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Node info in sync
    writer.go:29: 2020-03-10T15:42:22.708Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Service in sync: service=webserver1
    writer.go:29: 2020-03-10T15:42:22.708Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check in sync: check=service:webserver1
    writer.go:29: 2020-03-10T15:42:23.415Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check status updated: check=service:webserver1 status=passing
    writer.go:29: 2020-03-10T15:42:24.416Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check status updated: check=service:webserver1 status=passing
    writer.go:29: 2020-03-10T15:42:25.417Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check status updated: check=service:webserver1 status=passing
    writer.go:29: 2020-03-10T15:42:26.418Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check status updated: check=service:webserver1 status=passing
    writer.go:29: 2020-03-10T15:42:27.419Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check status updated: check=service:webserver1 status=passing
    writer.go:29: 2020-03-10T15:42:28.420Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus: Check status updated: check=service:webserver1 status=passing
    retry.go:121: assertions.go:261: 
        	Error Trace:	agent_test.go:3463
        	            				retry.go:130
        	            				asm_amd64.s:1357
        	Error:      	Not equal: 
        	            	expected: "critical"
        	            	actual  : "passing"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-critical
        	            	+passing
        	Messages:   	check {"service:webserver1" {}} is wrong
        
        
    writer.go:29: 2020-03-10T15:42:28.532Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Requesting shutdown
    writer.go:29: 2020-03-10T15:42:28.532Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server: shutting down server
    writer.go:29: 2020-03-10T15:42:28.532Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus.leader: stopping routine: routine="federation state anti-entropy"
    writer.go:29: 2020-03-10T15:42:28.532Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus.leader: stopping routine: routine="federation state pruning"
    writer.go:29: 2020-03-10T15:42:28.532Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus.leader: stopping routine: routine="CA root pruning"
    writer.go:29: 2020-03-10T15:42:28.532Z [WARN]  TestAgent_ReloadConfigAndKeepChecksStatus.server.serf.lan: serf: Shutdown without a Leave
    writer.go:29: 2020-03-10T15:42:28.532Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus.leader: stopped routine: routine="federation state pruning"
    writer.go:29: 2020-03-10T15:42:28.532Z [DEBUG] TestAgent_ReloadConfigAndKeepChecksStatus.leader: stopped routine: routine="CA root pruning"
    writer.go:29: 2020-03-10T15:42:28.532Z [WARN]  TestAgent_ReloadConfigAndKeepChecksStatus.server.serf.wan: serf: Shutdown without a Leave
    writer.go:29: 2020-03-10T15:42:28.532Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus.server.router.manager: shutting down
    writer.go:29: 2020-03-10T15:42:28.532Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: consul server down
    writer.go:29: 2020-03-10T15:42:28.532Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: shutdown complete
    writer.go:29: 2020-03-10T15:42:28.532Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Stopping server: protocol=DNS address=127.0.0.1:10122 network=tcp
    writer.go:29: 2020-03-10T15:42:28.532Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Stopping server: protocol=DNS address=127.0.0.1:10122 network=udp
    writer.go:29: 2020-03-10T15:42:28.532Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Stopping server: protocol=HTTP address=127.0.0.1:10123 network=tcp
    writer.go:29: 2020-03-10T15:42:28.532Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Waiting for endpoints to shut down
    writer.go:29: 2020-03-10T15:42:28.532Z [INFO]  TestAgent_ReloadConfigAndKeepChecksStatus: Endpoints down

I have a suspicion that #7345 restored a racy behavior that had previously been fixed, while it was fixing a different thing that regressed.

@rboyer rboyer added the theme/testing Testing, and related enhancements label Mar 10, 2020
@rboyer
Copy link
Member Author

rboyer commented Mar 10, 2020

Related: #7042

pierresouchay added a commit to pierresouchay/consul that referenced this issue Mar 10, 2020
On my machine, with some load, TestAgent_ReloadConfigAndKeepChecksStatus is failing 1/3 of time
because interval is low (1s). Remove the first test for critical as it is tested elsewhere.

Setting the interval to 3s of interval also fixes the issue but slow down all tests

This will avoid test to be unstable and fix hashicorp#7425.

test protocal on my machine to ensure it is not a bigger problem:

```shell
while true
do
  go clean -testcache
  go test -timeout 30s github.com/hashicorp/consul/agent -run '^TestAgent_ReloadConfigAndKeepChecksStatus$'
done
```

=> with 1s interval => fails 1/3 of time
=> with 3s interval => never fails

go test -timeout 30s github.com/hashicorp/consul/agent -run ^(TestAgent_ReloadConfigAndKeepChecksStatus)$
@pierresouchay
Copy link
Contributor

@rboyer to me, this is just a timing issue, I did put a fix in #7426 (with how I tested it)
Setting the Interval to 3s => not any error, the initial test was too agressive (and not really useful), so removed the check for Critical instead and kept 1s to avoid having too long run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/testing Testing, and related enhancements
Projects
None yet
2 participants