Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: failed test: bank/node-restart #30064

Closed
cockroach-teamcity opened this issue Sep 11, 2018 · 111 comments · Fixed by #31408
Closed

teamcity: failed test: bank/node-restart #30064

cockroach-teamcity opened this issue Sep 11, 2018 · 111 comments · Fixed by #31408
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed:

#897838:

--- FAIL: roachtest/acceptance/bank/node-restart (32.231s)
	test.go:503,bank.go:291,bank.go:398,acceptance.go:59: stall detected at round 1, no forward progress for 30s

--- FAIL: roachtest/acceptance/bank/node-restart (32.231s)
	test.go:503,bank.go:291,bank.go:398,acceptance.go:59: stall detected at round 1, no forward progress for 30s


Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Sep 11, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Sep 11, 2018
@petermattis
Copy link
Collaborator

#29678 also has a failure for this test, though a different failure mode.

@petermattis petermattis self-assigned this Sep 12, 2018
@petermattis petermattis modified the milestones: 2.2, 2.1 Sep 12, 2018
@petermattis
Copy link
Collaborator

I'm unable to reproduce this failure locally. Seems to have happened several times on CI.

@petermattis
Copy link
Collaborator

It's interesting that the stall occurs on round 1. Looks like the cluster never came up. The logs on the nodes don't show anything interesting:

I180911 14:27:07.082225 1 util/log/clog.go:1151  [config] file created at: 2018/09/11 14:27:07
I180911 14:27:07.082225 1 util/log/clog.go:1151  [config] running on machine: c688e08a1da4
I180911 14:27:07.082225 1 util/log/clog.go:1151  [config] binary: CockroachDB CCL v2.2.0-alpha.00000000-670-g83a7cb5 (x86_64-pc-linux-gnu, built 2018/09/11 14:25:15, go1.10.3)
I180911 14:27:07.082225 1 util/log/clog.go:1151  [config] arguments: [/home/roach/local/1/cockroach start --insecure --store=path=/home/roach/local/1/data --log-dir=/home/roach/local/1/logs --background --cache=6% --max-sql-memory=6% --port=26257 --http-port=26258 --locality=region=local,zone=local --host=127.0.0.1]
I180911 14:27:07.082225 1 util/log/clog.go:1151  line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg utf8=✓

@petermattis
Copy link
Collaborator

Here's something odd. The test indicated that it started the cluster, then stopped node 3 and restarted it. Yet I see 2 log files on nodes 1, 2 and 4 while I only expect to see 1. Simple manual test of this scenario shows only 1 log produced on the nodes that were started once.

Oh, it looks like one of the log files is from the invocation of cockroach with the --background flag. That doesn't seem to reproduce on Mac OS X. Perhaps a Linux specific bug. Unlikely to be the cause of the stall.

@tbg tbg changed the title teamcity: failed tests on master: roachtest/acceptance teamcity: failed tests: cluster-recovery Sep 25, 2018
@tbg tbg changed the title teamcity: failed tests: cluster-recovery teamcity: failed tests: node-restart Sep 25, 2018
@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label Sep 25, 2018
@petermattis
Copy link
Collaborator

While investigating another acceptance failure, I reproduced the problem seen in the original message:

--- FAIL: acceptance/bank/node-restart (47.53s)
        test.go:503,bank.go:291,bank.go:402,acceptance.go:60: stall detected at round 6, no forward progress for 30s
> /usr/local/bin/roachprod destroy local
local: stopping
local: wipingclient 1 shutting down

Unfortunately, I didn't specify the --debug flag so the cluster was wiped. The reproduction occurred on a 24-CPU GCE worker machine running roachtest -l --count=100 acceptance/bank.

@tbg tbg changed the title teamcity: failed tests: node-restart teamcity: failed test: bank/node-restart Sep 26, 2018
@tbg
Copy link
Member

tbg commented Sep 27, 2018

Starting to repro this too.

--- FAIL: acceptance/bank/node-restart (32.14s)
	test.go:503,bank.go:291,bank.go:402,acceptance.go:60: stall detected at round 1, no forward progress for 30s

Don't think a single write to the DB has happened (the db is running happily and has a fresh accounts table). Going to add some stack traces and retry.

@tbg
Copy link
Member

tbg commented Sep 27, 2018

Hmm, looking at a stuck run in round 18.

The cluster is happy, and the accounts table has only zeroes as well. Wouldn't be surprised if the test is doing something silly.

@tbg
Copy link
Member

tbg commented Sep 27, 2018

I CTRL+C'ed a test and can't sure whether this is why that happened, but I got stuck with this stack trace:

oroutine 59 [IO wait, 1 minutes]:
internal/poll.runtime_pollWait(0x7fc2683adc90, 0x77, 0xc42062eea0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc42068d518, 0x77, 0xc42062ee00, 0xc42062eef0, 0xc42062ee01)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitWrite(0xc42068d518, 0xc42068d500, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:94 +0x3d
internal/poll.(*FD).WaitWrite(0xc42068d500, 0xc420042070, 0xc420042070)
	/usr/local/go/src/internal/poll/fd_unix.go:440 +0x37
net.(*netFD).connect(0xc42068d500, 0x1b95b80, 0xc420042070, 0x0, 0x0, 0x1b7f8c0, 0xc4202f7880, 0x0, 0x0, 0x0, ...)
	/usr/local/go/src/net/fd_unix.go:152 +0x299
net.(*netFD).dial(0xc42068d500, 0x1b95b80, 0xc420042070, 0x1b9d080, 0x0, 0x1b9d080, 0xc4206b8270, 0xc42062f0f0, 0x75d83e)
	/usr/local/go/src/net/sock_posix.go:142 +0xe9
net.socket(0x1b95b80, 0xc420042070, 0x19adb98, 0x3, 0x2, 0x1, 0x0, 0x0, 0x1b9d080, 0x0, ...)
	/usr/local/go/src/net/sock_posix.go:93 +0x1a6
net.internetSocket(0x1b95b80, 0xc420042070, 0x19adb98, 0x3, 0x1b9d080, 0x0, 0x1b9d080, 0xc4206b8270, 0x1, 0x0, ...)
	/usr/local/go/src/net/ipsock_posix.go:141 +0x129
net.doDialTCP(0x1b95b80, 0xc420042070, 0x19adb98, 0x3, 0x0, 0xc4206b8270, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/tcpsock_posix.go:62 +0xb9
net.dialTCP(0x1b95b80, 0xc420042070, 0x19adb98, 0x3, 0x0, 0xc4206b8270, 0x1d6c7b5dbbd31, 0x2008ed4, 0x2008ed401b95b80)
	/usr/local/go/src/net/tcpsock_posix.go:58 +0xe4
net.dialSingle(0x1b95b80, 0xc420042070, 0xc42068d480, 0x1b83ec0, 0xc4206b8270, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/dial.go:547 +0x375
net.dialSerial(0x1b95b80, 0xc420042070, 0xc42068d480, 0xc4202bc360, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/dial.go:515 +0x22d
net.(*Dialer).DialContext(0xc42062f630, 0x1b95b80, 0xc420042070, 0x19adb98, 0x3, 0xc4202e8940, 0xf, 0x0, 0x0, 0x0, ...)
	/usr/local/go/src/net/dial.go:397 +0x678
net.(*Dialer).Dial(0xc42062f630, 0x19adb98, 0x3, 0xc4202e8940, 0xf, 0x5ff7a7, 0x0, 0xc42062f660, 0x3)
	/usr/local/go/src/net/dial.go:320 +0x75
net.Dial(0x19adb98, 0x3, 0xc4202e8940, 0xf, 0xbeb668, 0x5c5920, 0x19bd16a, 0x19bd197)
	/usr/local/go/src/net/dial.go:291 +0x99
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.defaultDialer.Dial(0x19adb98, 0x3, 0xc4202e8940, 0xf, 0x182d8c0, 0xc4206247e0, 0xc420624838, 0x10)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:95 +0x49
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.dial(0x1b85040, 0x28808a8, 0xc4206b8150, 0x0, 0xc4206248a8, 0x1, 0x2)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:379 +0x35e
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.DialOpen(0x1b85040, 0x28808a8, 0xc4204ca1c1, 0x2f, 0x0, 0x0, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:338 +0x553
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.Open(0xc4204ca1c1, 0x2f, 0xc42062fa60, 0x5fc165, 0xc420827500, 0xc420646000)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:251 +0x4d
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*Driver).Open(0x28808a8, 0xc4204ca1c1, 0x2f, 0x9325bd, 0xc40000003a, 0xc4206b80c0, 0x86)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:45 +0x35
database/sql.dsnConnector.Connect(0xc4204ca1c1, 0x2f, 0x1b7e8a0, 0x28808a8, 0x1b95b80, 0xc420042070, 0x0, 0x0, 0xc420545f00, 0x7fc268400000)
	/usr/local/go/src/database/sql/sql.go:600 +0x45
database/sql.(*DB).conn(0xc4204ba8c0, 0x1b95b80, 0xc420042070, 0xc42062fd01, 0x94595c, 0xc4204bb540, 0xc4206246c0)
	/usr/local/go/src/database/sql/sql.go:1103 +0x131
database/sql.(*DB).exec(0xc4204ba8c0, 0x1b95b80, 0xc420042070, 0x1a1cf11, 0xb5, 0xc42062fee8, 0x3, 0x3, 0x4db34b9682630c01, 0xc42062fdb8, ...)
	/usr/local/go/src/database/sql/sql.go:1367 +0x66
database/sql.(*DB).ExecContext(0xc4204ba8c0, 0x1b95b80, 0xc420042070, 0x1a1cf11, 0xb5, 0xc42062fee8, 0x3, 0x3, 0x10, 0x1756f20, ...)
	/usr/local/go/src/database/sql/sql.go:1349 +0xe1
database/sql.(*DB).Exec(0xc4204ba8c0, 0x1a1cf11, 0xb5, 0xc42062fee8, 0x3, 0x3, 0xc42062fed0, 0x6b3676, 0xbee34825c2005275, 0xb6362116)
	/usr/local/go/src/database/sql/sql.go:1363 +0x85
main.(*bankClient).transferMoney(0xc420447950, 0x3e7, 0x3e7, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:63 +0x206
main.(*bankState).transferMoney(0xc42042cd70, 0x1b95b40, 0xc4204c2640, 0xc4205ee0c0, 0x1, 0x3e7, 0x3e7)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:152 +0xb6
created by main.runBankNodeRestart
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:390 +0x3af

@tbg
Copy link
Member

tbg commented Sep 27, 2018

Hrm, this test is a bit nonsensical, look at this query:

UPDATE bank.accounts
   SET balance = CASE id WHEN $1 THEN balance-$3 WHEN $2 THEN balance+$3 END
 WHERE id IN ($1, $2) AND (SELECT balance >= $3 FROM bank.accounts WHERE id = $1)

This is a noop if the withdrawal account $1 doesn't have an amount of $3 on it, but how would it? We never initialize any account with a nonzero value in this test.

@petermattis
Copy link
Collaborator

this test is a bit nonsensical

Did I cut&paste something incorrectly from the old acceptance test? Doesn't look like it.

@tbg
Copy link
Member

tbg commented Sep 27, 2018

Got the stall error too, on the first round. The cluster is healthy and happy to respond to queries.

=== RUN   acceptance/bank/node-restart
> /home/tschottdorf/go/bin/roachprod wipe local
local: stopping
local: wiping
> /home/tschottdorf/go/bin/roachprod put local /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/cockroach ./cockroach
local: putting /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/cockroach ./cockroach

   1: done
   2: done
   3: done
   4: done
> /home/tschottdorf/go/bin/roachprod start local
local: starting
local: initializing cluster settings
SET CLUSTER SETTING
monkey starts (seed -3775595060315039375)
round 1: restarting nodes [2]
round 1: restarting 2
> /home/tschottdorf/go/bin/roachprod stop local:2
local: stopping
> /home/tschottdorf/go/bin/roachprod start local:2
local: starting
round 1: monkey sleeping while cluster recovers...
round 1: client counts: (0)
goroutine 31 [running]:
runtime/debug.Stack(0x6b3676, 0xbee34ddc05743323, 0x771c1682b)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x22
main.(*bankState).waitClientsStop(0xc42042c0f0, 0x1b95e00, 0xc42002a200, 0xc42038fe40, 0xc4208d20c0, 0x6fc23ac00)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:299 +0x76e
main.runBankNodeRestart(0x1b95e00, 0xc42002a200, 0xc42038fe40, 0xc4208d20c0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:418 +0x662
main.registerAcceptance.func1(0x1b95e00, 0xc42002a200, 0xc42038fe40, 0xc4208d20c0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:60 +0x8e
main.(*registry).run.func2(0x1a758b0, 0xc42038fe40, 0xc420292780, 0xc4204e43e0, 0xc42000e098, 0x1b95e00, 0xc420661300, 0xc420823400)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test.go:792 +0x458
created by main.(*registry).run
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test.go:638 +0x2ed
round 1: not waiting for recovery due to signal that we're done
> /home/tschottdorf/go/bin/roachprod get local logs artifacts/acceptance/bank/node-restart/logs
local: getting logs artifacts/acceptance/bank/node-restart/logs

   1: done
   2: done
   3: done
   4: done
--- FAIL: acceptance/bank/node-restart (32.18s)
	test.go:503,bank.go:300,bank.go:418,acceptance.go:60: stall detected at round 1, no forward progress for 30s

@tbg
Copy link
Member

tbg commented Sep 27, 2018

Did I cut&paste something incorrectly from the old acceptance test? Doesn't look like it.

Might've been nonsensical for a long time. We just need to initialize with, say, 100, instead of a zero balance and it'll actually do something. But first let's deflake it...

@tbg
Copy link
Member

tbg commented Sep 27, 2018

The "stall detected" message comes with this goroutine stack (the other goroutines are doing harmless stuff):

goroutine 47 [IO wait]:
internal/poll.runtime_pollWait(0x7f7105916bc0, 0x72, 0xc420233388)
	/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4205d4d18, 0x72, 0xffffffffffffff00, 0x1b81060, 0x2607d20)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4205d4d18, 0xc420575000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4205d4d00, 0xc420575000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc4205d4d00, 0xc420575000, 0x1000, 0x1000, 0xc42005e570, 0xc42005e500, 0xc4208ca098)
	/usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc42033e008, 0xc420575000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:176 +0x6a
bufio.(*Reader).Read(0xc4202c2180, 0xc4201302e0, 0x5, 0x200, 0x50, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0x1b7c2c0, 0xc4202c2180, 0xc4201302e0, 0x5, 0x200, 0x5, 0x8, 0x18, 0xc4202943e0)
	/usr/local/go/src/io/io.go:309 +0x86
io.ReadFull(0x1b7c2c0, 0xc4202c2180, 0xc4201302e0, 0x5, 0x200, 0x1ff, 0x50, 0xc420233620)
	/usr/local/go/src/io/io.go:327 +0x58
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).recvMessage(0xc4201302c0, 0xc4202943e0, 0x1ff, 0x50, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:947 +0xfe
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).recv(0xc4201302c0, 0xc420233768, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:977 +0x7c
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).startup(0xc4201302c0, 0xc4204c25d0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:1099 +0x962
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.DialOpen(0x1b85400, 0x28808a8, 0xc42043c3c1, 0x2f, 0x0, 0x0, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:344 +0x646
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.Open(0xc42043c3c1, 0x2f, 0x0, 0x0, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:251 +0x4d
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*Driver).Open(0x28808a8, 0xc42043c3c1, 0x2f, 0x0, 0x0, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:45 +0x35
database/sql.dsnConnector.Connect(0xc42043c3c1, 0x2f, 0x1b7ec60, 0x28808a8, 0x1b95f40, 0xc420042070, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/database/sql/sql.go:600 +0x45
database/sql.(*DB).conn(0xc4203d8820, 0x1b95f40, 0xc420042070, 0x1, 0x0, 0x0, 0x0)
	/usr/local/go/src/database/sql/sql.go:1103 +0x131
database/sql.(*DB).exec(0xc4203d8820, 0x1b95f40, 0xc420042070, 0x1a1d299, 0xb5, 0xc420233ee8, 0x3, 0x3, 0x7f7105974f01, 0xc42047d505, ...)
	/usr/local/go/src/database/sql/sql.go:1367 +0x66
database/sql.(*DB).ExecContext(0xc4203d8820, 0x1b95f40, 0xc420042070, 0x1a1d299, 0xb5, 0xc420233ee8, 0x3, 0x3, 0x10, 0x17572a0, ...)
	/usr/local/go/src/database/sql/sql.go:1349 +0xe1
database/sql.(*DB).Exec(0xc4203d8820, 0x1a1d299, 0xb5, 0xc42047d6e8, 0x3, 0x3, 0x3a98cde8, 0xc42047d6d0, 0x6b3676, 0xbee34edc50d05a97)
	/usr/local/go/src/database/sql/sql.go:1363 +0x85
main.(*bankClient).transferMoney(0xc4204c3590, 0x3e7, 0x3e7, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:65 +0x212
main.(*bankState).transferMoney(0xc4204c0cd0, 0x1b95f00, 0xc4204be580, 0xc4204da0c0, 0x1, 0x3e7, 0x3e7)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:159 +0xba
created by main.runBankNodeRestart
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:410 +0x48a

The server is happy. I'm afraid this means that what I want is a stack trace of the node the client is talking to. Let's see.

@tbg
Copy link
Member

tbg commented Sep 27, 2018

Hmm. I had what I thought was a failing test run:

test finished, waiting for shutdown of 4 clients
round 26: not waiting for recovery due to signal that we're done
[   2] acceptance/bank/node-restart: ??? (0s)
[   3] acceptance/bank/node-restart: ??? (0s)
[   4] acceptance/bank/node-restart: ??? (0s)
[   5] acceptance/bank/node-restart: ??? (0s)
client 1 shutting down
11666 transfers (36.4/sec) in 320.4s

The client 1 shutting down line coincided exactly with my running select * from bank.accounts where balance = 0; against the cluster. Maybe a coincidence. I'll do the same next time.

@tbg
Copy link
Member

tbg commented Sep 27, 2018

Yeah - the same thing happened again. client 1 shutting down precisely as I query the table.

round 14: monkey sleeping while cluster recovers...
round 14: client counts: (8977)
[   1] acceptance/bank/node-restart: ??? (0s)
round 14: not waiting for recovery due to signal that we're done
[   2] acceptance/bank/node-restart: ??? (0s)
client 1 shutting down
8978 transfers (55.3/sec) in 162.3s
> /home/tschottdorf/go/bin/roachprod get local logs artifacts/acceptance/bank/node-restart/logs
local: getting logs artifacts/acceptance/bank/node-restart/logs

   1: done
   2: done
   3: done
   4: done
--- PASS: acceptance/bank/node-restart (164.61s)

@tbg
Copy link
Member

tbg commented Sep 27, 2018

Next hang. This time, the cluster is not fine (connecting via the SQL shell times out).

[   1] acceptance/bank/node-restart: ??? (0s)
test finished, waiting for shutdown of 4 clients
round 25: not waiting for recovery due to signal that we're done
[   2] acceptance/bank/node-restart: ??? (0s)
[   3] acceptance/bank/node-restart: ??? (0s)
[   4] acceptance/bank/node-restart: ??? (0s)
[   5] acceptance/bank/node-restart: ??? (0s)
[   6] acceptance/bank/node-restart: ??? (0s)
[   7] acceptance/bank/node-restart: ??? (0s)
[   8] acceptance/bank/node-restart: ??? (0s)
[   9] acceptance/bank/node-restart: ??? (0s)

Okay, here's what the node4 (the one the client connects to) say about this connection:

goroutine 547 [IO wait]:
internal/poll.runtime_pollWait(0x7f022e731bc0, 0x72, 0xc429fd1730)
        /usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc42061c518, 0x72, 0xffffffffffffff00, 0x2f2a900, 0x4142248)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc42061c518, 0xc42526c000, 0x1000, 0x1000)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc42061c500, 0xc42526c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc42061c500, 0xc42526c000, 0x1000, 0x1000, 0x0, 0xc429fd1868, 0x82419c)
        /usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420842038, 0xc42526c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:176 +0x6a
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux.(*MuxConn).Read(0xc4208245a0, 0xc42526c000, 0x1000, 0x1000, 0x0, 0
x0, 0x0)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux/cmux.go:218 +0x119
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*readTimeoutConn).Read(0xc420389280, 0xc42526c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:1372 +0x110
bufio.(*Reader).fill(0xc425232060)
        /usr/local/go/src/bufio/bufio.go:100 +0x11e
bufio.(*Reader).ReadByte(0xc425232060, 0xbee351ab872a5d51, 0x4c92b6824, 0x46c8ca0)
        /usr/local/go/src/bufio/bufio.go:242 +0x39
github.com/cockroachdb/cockroach/pkg/sql/pgwire/pgwirebase.(*ReadBuffer).ReadTypedMsg(0xc4252321d0, 0x2f3ab20, 0xc425232060, 0x2f34220
, 0x4717650, 0x0, 0x0)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/pgwirebase/encoding.go:120 +0x35
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl(0xc425232000, 0x2f51de0, 0xc420a16400, 0xc4202bb6b0, 0xc420890100, 0
x5400, 0x15000, 0xc4205d9c70, 0xc420b38090, 0x0, ...)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:285 +0x5bd
github.com/cockroachdb/cockroach/pkg/sql/pgwire.serveConn(0x2f51de0, 0xc420a16380, 0x2f6cb00, 0xc4208245a0, 0x0, 0x0, 0xc42522e01e, 0x4, 0x0, 0x0, ...)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:163 +0x207
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).ServeConn(0xc4205d9b00, 0x2f51de0, 0xc420a16380, 0x2f6cb00, 0xc4208245a0, 0x0, 0x0)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:453 +0x711
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start.func19.1(0x2f6cb00, 0xc4208245a0)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1631 +0x180
github.com/cockroachdb/cockroach/pkg/util/netutil.(*Server).ServeWith.func1(0xc420b38090, 0x2f51ea0, 0xc42070c3f0, 0xc420304420, 0x2f6cb00, 0xc4208245a0, 0xc42070c420)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/netutil/net.go:141 +0xa9
created by github.com/cockroachdb/cockroach/pkg/util/netutil.(*Server).ServeWith
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/netutil/net.go:138 +0x249

@tbg
Copy link
Member

tbg commented Sep 27, 2018

Interesting. I can actually connect to the SQL shell and query the table on n4, but not on n1 and n2. Then I tried n1 again and it worked. n2 still doesn't. Hmm. I wonder if this has something to do with it:

$ lsof | grep tschottdorf | wc -l
11562

@tbg
Copy link
Member

tbg commented Sep 27, 2018

Restricting to network file descriptors there are only a few dozen which seems reasonable, so back to the nodes itself... n1 is clearly unhappy:

goroutine 26 [select, 19 minutes]:
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).lookupRangeDescriptorInternal(0xc420824550, 0x2f51ea0, 0xc420a90180, 0xc420309c50, 0x10, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:356 +0x4f7
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).LookupRangeDescriptor(0xc420824550, 0x2f51ea0, 0xc420a90180, 0xc420309c50, 0x10, 0x10, 0x0, 0x3fc3333333333300, 0xc420b36540, 0x0, ...)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:241 +0x92
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).getDescriptor(0xc420264600, 0x2f51ea0, 0xc420a90180, 0xc420309c50, 0x10, 0x10, 0x0, 0xc420b36500, 0x2faf080, 0x3b9aca00, ...)

@nvanbenschoten
Copy link
Member

Do you have evidence that we're sending super-large batches on the Raft transport?

No, and that theory didn't pan out. I still see all followers slowly fall behind the leader under heavy load.

@nvanbenschoten
Copy link
Member

Everything I've written about so far is concerning and deserves continued investigation. It is pretty easy to reproduce with #31081 if you disable nobarrier and use Jaeger to watch for long requests. However, I think I've convinced myself that it's not the issue we see here in this test with the stalls because we don't see any of its symptoms in the log. I've opened #31330 to continue tracking it.

To recap, we've fixed three separate stalls so far and the test passes reliably for about 2-3 hours of stress. When it does fail, it doesn't stall indefinitely as it did with the other issues. Instead, it stalls for only about 60 seconds. I suspect that the following log explains these 60-second stalls. We see these on n4, which is serving as the SQL gateway in the test, whenever it fails.

E181012 22:26:54.548553 99 sqlmigrations/migrations.go:370  [n4] failed attempt to acquire migration lease: lease /System/"system-version/lease" is not available until at least 1539383274.491341436,0

The logs continue for exactly a minute before going away. These logs originate from EnsureMigrations, and they indicate that the Server is not yet accepting SQL connections. I suspect that the issue here is that we kill a node which has acquired but not yet released thie sql migration lease. To back this up, I see that n2 was killed immediately after logging

I181012 22:26:54.413824 90 cli/start.go:507  initial startup completed, will now wait for `cockroach init`

We then establish a SQL connection to a node stuck trying to acquire this lease and needing to wait for the abandoned lease to expire. I don't have a good solution here in the general case, but there are a few things we could do in the test to fix this and prevent the flake:

  1. we could round-robin the SQL connection instead of only using a single node that could get stuck
  2. we could ensure that all nodes can serve SQL connections before begining the chaos monkey
  3. we could adjust roachprod to wait until server initialization is complete before returning from roachprod start

I think the second option is the way to go, but I want to make sure it's not undermining the purpose of the test. @petermattis could you weigh in on that? Are we trying to test everything from the server startup process onwards while under chaos?

Also for what it's worth, I only noticed this because I searched for ^E in the logs. I don't think I would have caught it if my branch had included #31270. Depending on the outcome of this investigation, we may want to revert that.

@petermattis
Copy link
Collaborator

Interesting. Is it really an inability to get the migrations lease which is causing these 60s stalls? If we disable that code do the stalls go away? Why are we even trying to get the migrations lease? Looks to me like EnsureMigrations only tries to get the lease if there are migrations to run. So what migration hasn't been run given that the cluster is created and only ever using a single version of cockroach?

Of the mitigation options, the second sounds the best.

@tbg
Copy link
Member

tbg commented Oct 13, 2018

I'm also confused why this is happening. Are there really any outstanding migrations? If so, do you see this only just after the cluster got booted up? Something seems fishy here.

@nvanbenschoten
Copy link
Member

If we disable that code do the stalls go away?

I'm testing that right now. I suspect that they will unless there's another class of stalls.

Why are we even trying to get the migrations lease? Looks to me like EnsureMigrations only tries to get the lease if there are migrations to run. So what migration hasn't been run given that the cluster is created and only ever using a single version of cockroach?

It doesn't look like there is any protection against a fresh cluster running these SQL migrations, and I can certianly see all backwardCompatibleMigrations with a workFn run in a fresh cluster if I drop some logging into that area of the code. The cluster will search for migrations keys, find none, and run all available migrations. I supose they should all be quick because they are mostly all no-ops, and it's important that they create migration keys as they complete. @benesch do you know whether this was intentional, an oversight, or somewhere in between?

@benesch
Copy link
Contributor

benesch commented Oct 14, 2018

It doesn't look like there is any protection against a fresh cluster running these SQL migrations, and I can certianly see all backwardCompatibleMigrations with a workFn run in a fresh cluster if I drop some logging into that area of the code. The cluster will search for migrations keys, find none, and run all available migrations. I supose they should all be quick because they are mostly all no-ops, and it's important that they create migration keys as they complete. @benesch do you know whether this was intentional, an oversight, or somewhere in between?

It's intentional. I've never liked it much, but we use migrations as a convenient place to run cluster bootstrapping code (i.e., code that must be run at least once when a cluster is first created). For example:

{
// Introduced in v1.1. Permanent migration.
name: "populate initial version cluster setting table entry",
workFn: populateVersionSetting,
},

(which I think Tobi added actually 😀)

@benesch
Copy link
Contributor

benesch commented Oct 14, 2018

Also, even without these "permanent" migrations, fresh clusters still tend to have migrations:

// Adding a new system table? Don't add it to the metadata schema yet!
//
// The first release to contain the system table must add the system table
// via a migration in both fresh clusters and existing clusters. Only in the
// next release should you "bake in" the migration by adding the table to
// the metadata schema here. This ensures there's only ever one code path
// responsible for creating the table.

@tbg
Copy link
Member

tbg commented Oct 15, 2018

Somehow I thought you said that there was a lease acquired every time a node started, but I just verified (and it seems that you're all implying that) this only happens once at the beginning of a new cluster (assuming all the migrations get applied before something gets killed). I think it's fine if we just work around that then. Should be enough to SELECT 1 in the monkey before going for the kill, right?

@nvanbenschoten
Copy link
Member

Should be enough to SELECT 1 in the monkey before going for the kill, right?

Yeah exactly, and in doing so I've only had a single stall (which was persistent) in the last 36 hours of stressing the test. I'll make that change, fix the proposal forwarding issues now that etcd-io/etcd#10167 is merged, and rebase the test on master to continue the hunt. It's possible we've already fixed this stall, and if not, it appears to be exceedingly rare.

@petermattis
Copy link
Collaborator

Should be enough to SELECT 1 in the monkey before going for the kill, right?

I think it would be sufficient to verify all nodes can serve a SELECT 1 before starting up the test proper. This doesn't have to be done before every node kill, though I suppose that doesn't hurt.

@nvanbenschoten
Copy link
Member

I was doing it at the beginning of bankState.chaosMonkey, which has the effect you're describing.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 15, 2018
Informs cockroachdb#30064.

This prevents us from running into issues with SQL migrations.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 15, 2018
Informs cockroachdb#30064.

This prevents us from running into issues with SQL migrations.

Release note: None
craig bot pushed a commit that referenced this issue Oct 15, 2018
30989: roachtest/bank: don't start chaos until nodes are serving SQL r=nvanbenschoten a=nvanbenschoten

Informs #30064.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
craig bot pushed a commit that referenced this issue Oct 17, 2018
31408: storage: replace remote proposal tracking with uncommitted log size protection r=nvanbenschoten a=nvanbenschoten

Closes #30064.

This change reverts most of the non-testing code from 03b116f and f2f3fd2
and replaces it with use of the MaxUncommittedEntriesSize config. This
configuration was added in etcd-io/etcd#10167 and provides protection against
unbounded Raft log growth when a Raft group stops being able to commit
entries. It makes proposals into Raft safer because proposers don't need
to verify before the fact that the proposal isn't a duplicate that might
be blowing up the size of the Raft group.

By default, the configuration is set to double the Replica's proposal quota.
The logic here is that the quotaPool should be responsible for throttling
proposals in all cases except for unbounded Raft re-proposals because it
queues efficiently instead of dropping proposals on the floor indiscriminately.

Release note (bug fix): Fix a bug where Raft proposals could get
stuck if forwarded to a leader who could not itself append a new
entry to its log.

This will be backported, but not to 2.1.0. The plan is to get it into 2.1.1.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in #31408 Oct 17, 2018
@nvanbenschoten
Copy link
Member

Yeah exactly, and in doing so I've only had a single stall (which was persistent) in the last 36 hours of stressing the test.

I'm not confident I actually had all of the fix from #31408 in place when I hit this stall, and after ensuring that, I haven't been able to trigger a stall again in more than 24 hours. I'm closing this for now but will keep stressing it for the next few days. We'll see if it ever comes up again.

@nvanbenschoten
Copy link
Member

4 days of continuous stress without a failure. I'm declaring this stable.

@petermattis
Copy link
Collaborator

4 days of continuous stress without a failure. I'm declaring this stable.

🎉 🎉 🎉

tbg added a commit to tbg/cockroach that referenced this issue Jan 10, 2019
This was forgotten, cockroachdb#30064 is long fixed :fingers_crossed:

Release note: None
craig bot pushed a commit that referenced this issue Jan 11, 2019
33549: acceptance: re-enable bank/zerosum-restart r=petermattis a=tbg

This was forgotten, #30064 is long fixed :fingers_crossed:

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
7 participants