Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: TestAPISwarmNodeDrainPause #23516

Open
thaJeztah opened this issue Jun 14, 2016 · 13 comments
Open

Flaky test: TestAPISwarmNodeDrainPause #23516

thaJeztah opened this issue Jun 14, 2016 · 13 comments

Comments

@thaJeztah
Copy link
Member

This test was recently added, but appears to be flaky. I think @aluzzardi is already working on a fix in SwarmKit;

https://jenkins.dockerproject.org/job/Docker%20Master/label=ubuntu%20&&%20aufs/5741/console

08:55:20 ----------------------------------------------------------------------
08:55:20 FAIL: docker_api_swarm_test.go:409: DockerSwarmSuite.TestApiSwarmNodeDrainPause
08:55:20 
08:55:20 [d27978813] waiting for daemon to start
08:55:20 [d27978813] daemon started
08:55:20 [d61468826] waiting for daemon to start
08:55:20 [d61468826] daemon started
08:55:20 docker_api_swarm_test.go:459:
08:55:20     waitAndAssert(c, defaultReconciliationTimeout, d1.checkActiveContainerCount, checker.Equals, instances-d2ContainerCount)
08:55:20 github.com/docker/docker/integration-cli/_test/_obj_test/docker_utils.go:2011:
08:55:20     ...open github.com/docker/docker/integration-cli/_test/_obj_test/docker_utils.go: no such file or directory
08:55:20 ... obtained int = 5
08:55:20 ... expected int = 12
08:55:20 ... output: "22b355e976cc\na53c75765516\nc9f1393e2298\n10a9c4df2d02\ndcb306eba796\n"
08:55:20 
08:55:20 [d27978813] exiting daemon
08:55:20 [d61468826] exiting daemon
08:55:21 
08:55:21 ----------------------------------------------------------------------
@thaJeztah thaJeztah added this to the 1.12.0 milestone Jun 14, 2016
@thaJeztah
Copy link
Member Author

test was temporarily disabled in #23526

@tophj-ibm
Copy link
Contributor

tophj-ibm commented Jun 16, 2016

Looks like ppc64le is still seeing failures on this test. The skip seems to be near the end of the test.
Edit: this failure appears to be fixed

----------------------------------------------------------------------
FAIL: docker_api_swarm_test.go:409: DockerSwarmSuite.TestApiSwarmNodeDrainPause

[d89110000] waiting for daemon to start
[d89110000] daemon started
docker_api_swarm_test.go:410:
   d1 := s.AddDaemon(c, true, true)
check_test.go:221:
   c.Assert(d.Init(aa, ""), check.IsNil)
... value *errors.errorString = &errors.errorString{s:"initializing swarm: invalid statuscode 406, \"{\\\"message\\\":\\\"this node is already part of a Swarm\\\"}\\n\""} ("initializing swarm: invalid statuscode 406, \"{\\\"message\\\":\\\"this node is already part of a Swarm\\\"}\\n\"")

----------------------------------------------------------------------

@tophj-ibm
Copy link
Contributor

looks like this is supposed to be fixed via b38408f, but it's now failing consistently on power for the original reason.

@michael-holzheu
Copy link
Contributor

Also on s390x six DockerSwarm test cases fail:

# cat bundles/latest/test-integration-cli/test.log | grep FAIL
FAIL: docker_api_swarm_test.go:533: TestApiSwarmLeaveRemovesContainer.pN57_github_com_docker_docker_integration_cli.DockerSwarmSuite
FAIL: docker_api_swarm_test.go:479: TestApiSwarmNodeDrainPause.pN57_github_com_docker_docker_integration_cli.DockerSwarmSuite
FAIL: docker_api_swarm_test.go:635: TestApiSwarmScaleNoRollingUpdate.pN57_github_com_docker_docker_integration_cli.DockerSwarmSuite
FAIL: docker_api_swarm_test.go:282: TestApiSwarmServicesCreate.pN57_github_com_docker_docker_integration_cli.DockerSwarmSuite
FAIL: docker_api_swarm_test.go:298: TestApiSwarmServicesMultipleAgents.pN57_github_com_docker_docker_integration_cli.DockerSwarmSuite
FAIL: docker_api_swarm_test.go:344: TestApiSwarmServicesStateReporting.pN57_github_com_docker_docker_integration_cli.DockerSwarmSuite
OOPS: 16 passed, 6 FAILED

Interestingly TestApiSwarmServicesCreateGlobal passes with gccgo and fails with golang.

@michael-holzheu
Copy link
Contributor

FYI: With commit 1bbac42 and go1.7 beta 2 on s390x:

FAIL: docker_api_swarm_test.go:533: DockerSwarmSuite.TestApiSwarmLeaveRemovesContainer
FAIL: docker_api_swarm_test.go:606: DockerSwarmSuite.TestApiSwarmManagerRestore
FAIL: docker_api_swarm_test.go:479: DockerSwarmSuite.TestApiSwarmNodeDrainPause
FAIL: docker_api_swarm_test.go:635: DockerSwarmSuite.TestApiSwarmScaleNoRollingUpdate
FAIL: docker_api_swarm_test.go:282: DockerSwarmSuite.TestApiSwarmServicesCreate
FAIL: docker_api_swarm_test.go:298: DockerSwarmSuite.TestApiSwarmServicesMultipleAgents
FAIL: docker_api_swarm_test.go:344: DockerSwarmSuite.TestApiSwarmServicesStateReporting

2016-06-21-1bbac428de015a24a0724a9d0990d500c445dcad-test-k4.5-2cpu-go1.7b2.txt

@tophj-ibm
Copy link
Contributor

So the issue is that even though the daemon has an arch specific busybox cached, the swarm daemon automatically pulls down another one. While not intended to fix this issue, #23790 seems to solve the problem.

@michael-holzheu can you see if that pr fixes the issue? alternative you can also temporarily change https://github.com/docker/docker/blob/master/integration-cli/docker_api_swarm_test.go#L665 to s390x/busybox:latest and that should do the trick as well.

@michael-holzheu
Copy link
Contributor

@michael-holzheu can you see if that pr fixes the issue?

@tophj-ibm I can confirm that PR #23790 also fixes the issue on s390x.
Thanks for investigating!

@clnperez
Copy link
Contributor

@thaJeztah Can we rename this issue since the test is no longer just flaky, but fails consistently? Or should we create another issue for the failing tests?

@clnperez
Copy link
Contributor

@thaJeztah -- So, n/m. I misunderstood the conversation in PR #23790. And since that PR was merged (Thanks @tonistiigi and @vdemeester) and all 6 tests pass for power, z, and I'm assuming arm as well. 🎉

@olljanat
Copy link
Contributor

olljanat commented Jan 9, 2019

This happened in Z on #38500 too:

06:14:29 FAIL: docker_api_swarm_node_test.go:75: DockerSwarmSuite.TestAPISwarmNodeDrainPause
06:14:29 
06:14:29 [d9256ae006cfc] waiting for daemon to start
06:14:29 [d9256ae006cfc] daemon started
06:14:29 
06:14:29 [d43aaec4277db] waiting for daemon to start
06:14:29 [d43aaec4277db] daemon started
06:14:29 
06:14:29 waited for 1.761984375s (out of 30s)
06:14:29 waited for 12.07028ms (out of 30s)
06:14:29 waited for 283.2214ms (out of 30s)
06:14:29 waited for 3.200374647s (out of 30s)
06:14:29 waited for 9.169988ms (out of 30s)
06:14:29 docker_api_swarm_node_test.go:104:
06:14:29     waitAndAssert(c, defaultReconciliationTimeout*2, reducedCheck(sumAsIntegers, d1.CheckActiveContainerCount, d2.CheckActiveContainerCount), checker.Equals, instances)
06:14:29 docker_utils_test.go:441:
06:14:29     c.Assert(v, checker, args...)
06:14:29 ... obtained int = 2
06:14:29 ... expected int = 1
06:14:29 ... output: "cda633412db1\n72aedc916530\n"
06:14:29 
06:14:29 waited for 1m0.108406781s (out of 1m0s)
06:14:29 [d9256ae006cfc] exiting daemon
06:14:29 [d43aaec4277db] exiting daemon

but looks to be quite rare error.

@andrewhsu
Copy link
Member

Flaky test was renamed from TestApiSwarmNodeDrainPause to TestAPISwarmNodeDrainPause with PR #26967 git commit 7fb7a47

Still looks like flakiness is an issue https://github.com/moby/moby/search?q=TestAPISwarmNodeDrainPause&type=Issues

@tiborvass tiborvass changed the title Flaky test: TestApiSwarmNodeDrainPause Flaky test: TestAPISwarmNodeDrainPause Jun 28, 2019
@thaJeztah
Copy link
Member Author

Looks to be caused by networking #38020 (comment)

memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use

Which is the same issue as #34051 (comment) (DockerSwarmSuite.TestSwarmLockUnlockCluster)

11:18:38 FAIL: docker_api_swarm_node_test.go:75: DockerSwarmSuite.TestAPISwarmNodeDrainPause
11:18:38 
11:18:38 Creating a new daemon
11:18:38 [dc139d4dd25da] waiting for daemon to start
11:18:38 [dc139d4dd25da] waiting for daemon to start
11:18:38 [dc139d4dd25da] daemon started
11:18:38 
11:18:38 Creating a new daemon
11:18:38 [d53117bd699ed] waiting for daemon to start
11:18:38 [d53117bd699ed] waiting for daemon to start
11:18:38 [d53117bd699ed] daemon started
11:18:38 
11:18:38 waited for 707.626002ms (out of 30s)
11:18:38 waited for 9.542429ms (out of 30s)
11:18:38 waited for 18.527488ms (out of 30s)
11:18:38 waited for 2.227709728s (out of 30s)
11:18:38 waited for 8.748001ms (out of 30s)
11:18:38 docker_api_swarm_node_test.go:103:
11:18:38     waitAndAssert(c, defaultReconciliationTimeout*2, reducedCheck(sumAsIntegers, d1.CheckActiveContainerCount, d2.CheckActiveContainerCount), checker.Equals, instances)
11:18:38 docker_utils_test.go:435:
11:18:38     c.Assert(v, checker, args...)
11:18:38 ... obtained int = 2
11:18:38 ... expected int = 1
11:18:38 ... output: "801028f0f936\ne5772591d4b2\n"
11:18:38 
11:18:38 waited for 1m0.063567148s (out of 1m0s)
11:18:38 [dc139d4dd25da] Stopping daemon
11:18:38 [dc139d4dd25da] exiting daemon
11:18:38 [dc139d4dd25da] Daemon stopped
11:18:38 [d53117bd699ed] Stopping daemon
11:18:38 [d53117bd699ed] exiting daemon
11:18:38 [d53117bd699ed] Daemon stopped

Daemon logs attached:

d53117bd699ed.log

time="2019-07-14T11:17:29.017856674Z" level=error msg="error in agentInit: failed to create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"
time="2019-07-14T11:17:29.017973645Z" level=error msg="error in agentInit: failed to create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"
time="2019-07-14T11:18:33.144068234Z" level=error msg="agent: session failed" backoff=100ms error=EOF module=node/agent node.id=vtpt9qjni2iiv4gkef4rl7c0f
time="2019-07-14T11:18:33.230004641Z" level=error msg="agent: session failed" backoff=300ms error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2477: connect: connection refused\"" module=node/agent node.id=vtpt9qjni2iiv4gkef4rl7c0f
time="2019-07-14T11:18:33.429552477Z" level=error msg="agent: session failed" backoff=700ms error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2477: connect: connection refused\"" module=node/agent node.id=vtpt9qjni2iiv4gkef4rl7c0f
time="2019-07-14T11:18:33.697972438Z" level=error msg="agent: session failed" backoff=1.5s error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2477: connect: connection refused\"" module=node/agent node.id=vtpt9qjni2iiv4gkef4rl7c0f

dc139d4dd25da.log

time="2019-07-14T11:17:28.210168437Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh1" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh1: no such file or directory"
time="2019-07-14T11:17:28.210183375Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh2" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh2: no such file or directory"
time="2019-07-14T11:17:28.210195867Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh3" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh3: no such file or directory"
time="2019-07-14T11:17:28.297891584Z" level=error msg="error reading the kernel parameter net.ipv4.vs.expire_nodest_conn" error="open /proc/sys/net/ipv4/vs/expire_nodest_conn: no such file or directory"
time="2019-07-14T11:17:33.046943421Z" level=error msg="failed to update task to REMOVE" error="update out of sequence" module=node node.id=owzw1jvmh1ovxpmpdut7nzyh5
time="2019-07-14T11:18:33.143549161Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=owzw1jvmh1ovxpmpdut7nzyh5 node.session=op23nfmjvobnaek7ds3yo5o74
time="2019-07-14T11:18:33.143688555Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=vtpt9qjni2iiv4gkef4rl7c0f node.session=13w16v5a0m1wtoni0jopeebf4
time="2019-07-14T11:18:33.144002407Z" level=error msg="failed to receive changes from store watch API" error="rpc error: code = Unknown desc = context canceled"
time="2019-07-14T11:18:33.945690551Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=dc139d4dd25da

@thaJeztah
Copy link
Member Author

thaJeztah commented Aug 6, 2019

Failing again on #39678 on PowerPC: https://ci.docker.com/public/job/moby/job/PR-39678/6/execution/node/208/log/

[2019-08-06T15:03:11.839Z] ----------------------------------------------------------------------
[2019-08-06T15:03:11.839Z] FAIL: docker_api_swarm_node_test.go:75: DockerSwarmSuite.TestAPISwarmNodeDrainPause
[2019-08-06T15:03:11.839Z] 
[2019-08-06T15:03:11.839Z] Creating a new daemon
[2019-08-06T15:03:11.839Z] [d1c7657d43a9c] waiting for daemon to start
[2019-08-06T15:03:11.839Z] [d1c7657d43a9c] waiting for daemon to start
[2019-08-06T15:03:11.839Z] [d1c7657d43a9c] daemon started
[2019-08-06T15:03:11.839Z] 
[2019-08-06T15:03:11.839Z] Creating a new daemon
[2019-08-06T15:03:11.839Z] [dac8441f4d97e] waiting for daemon to start
[2019-08-06T15:03:11.839Z] [dac8441f4d97e] waiting for daemon to start
[2019-08-06T15:03:11.839Z] [dac8441f4d97e] daemon started
[2019-08-06T15:03:11.839Z] 
[2019-08-06T15:03:11.839Z] [dac8441f4d97e] joining swarm manager [d1c7657d43a9c]@0.0.0.0:2477, swarm listen addr 0.0.0.0:2478
[2019-08-06T15:03:11.839Z] waited for 975.943123ms (out of 30s)
[2019-08-06T15:03:11.839Z] waited for 22.861292ms (out of 30s)
[2019-08-06T15:03:11.839Z] waited for 44.159303ms (out of 30s)
[2019-08-06T15:03:11.839Z] waited for 2.011645911s (out of 30s)
[2019-08-06T15:03:11.839Z] waited for 20.676265ms (out of 30s)
[2019-08-06T15:03:11.839Z] docker_api_swarm_node_test.go:103:
[2019-08-06T15:03:11.839Z]     waitAndAssert(c, defaultReconciliationTimeout*2, reducedCheck(sumAsIntegers, d1.CheckActiveContainerCount, d2.CheckActiveContainerCount), checker.Equals, instances)
[2019-08-06T15:03:11.839Z] docker_utils_test.go:435:
[2019-08-06T15:03:11.839Z]     c.Assert(v, checker, args...)
[2019-08-06T15:03:11.839Z] ... obtained int = 2
[2019-08-06T15:03:11.839Z] ... expected int = 1
[2019-08-06T15:03:11.839Z] ... output: "a7443fddaa57\nb98f0ab1bd58\n"
[2019-08-06T15:03:11.839Z] 
[2019-08-06T15:03:11.839Z] waited for 1m0.10053926s (out of 1m0s)
[2019-08-06T15:03:11.839Z] [d1c7657d43a9c] Stopping daemon
[2019-08-06T15:03:11.839Z] [d1c7657d43a9c] exiting daemon
[2019-08-06T15:03:11.839Z] [d1c7657d43a9c] Daemon stopped
[2019-08-06T15:03:11.839Z] [dac8441f4d97e] Stopping daemon
[2019-08-06T15:03:11.839Z] [dac8441f4d97e] exiting daemon
[2019-08-06T15:03:11.839Z] [dac8441f4d97e] Daemon stopped

errors from daemon d1c7657d43a9c
d1c7657d43a9c.log

time="2019-08-06T15:01:57.208530017Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh1" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh1: no such file or directory"
time="2019-08-06T15:01:57.208575881Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh2" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh2: no such file or directory"
time="2019-08-06T15:01:57.208620938Z" level=error msg="error reading the kernel parameter net.ipv4.neigh.default.gc_thresh3" error="open /proc/sys/net/ipv4/neigh/default/gc_thresh3: no such file or directory"
time="2019-08-06T15:01:57.462961979Z" level=error msg="error reading the kernel parameter net.ipv4.vs.expire_nodest_conn" error="open /proc/sys/net/ipv4/vs/expire_nodest_conn: no such file or directory"
time="2019-08-06T15:02:02.528829491Z" level=error msg="failed to update task to REMOVE" error="update out of sequence" module=node node.id=b0o1zsh7bniu9w78ydnxmj8rb
time="2019-08-06T15:03:02.678270181Z" level=error msg="agent: session failed" backoff=100ms error="context canceled" module=node/agent node.id=b0o1zsh7bniu9w78ydnxmj8rb
time="2019-08-06T15:03:02.679264948Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=bse3kxshxi2kuybpazk1y0d7q node.session=qyhbi07hd3gy640mxp01s9cl2
time="2019-08-06T15:03:02.681556067Z" level=error msg="failed to receive changes from store watch API" error="rpc error: code = Unknown desc = context canceled"
time="2019-08-06T15:03:03.469333232Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=d1c7657d43a9c

errors from daemon dac8441f4d97e
dac8441f4d97e.log

time="2019-08-06T15:01:58.412413449Z" level=error msg="error in agentInit: failed to create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on \"0.0.0.0\" port 7946: listen tcp 0.0.0.0:7946: bind: address already in use"
time="2019-08-06T15:03:02.679524911Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Unknown desc = context canceled" module=node/agent node.id=bse3kxshxi2kuybpazk1y0d7q
time="2019-08-06T15:03:02.707003304Z" level=error msg="agent: session failed" backoff=300ms error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2477: connect: connection refused\"" module=node/agent node.id=bse3kxshxi2kuybpazk1y0d7q
time="2019-08-06T15:03:02.969599063Z" level=error msg="agent: session failed" backoff=700ms error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2477: connect: connection refused\"" module=node/agent node.id=bse3kxshxi2kuybpazk1y0d7q
time="2019-08-06T15:03:08.500794580Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=dac8441f4d97e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants