teamcity: failed test: bank/node-restart #30064

cockroach-teamcity · 2018-09-11T14:56:15Z

The following tests appear to have failed:

--- FAIL: roachtest/acceptance/bank/node-restart (32.231s)
	test.go:503,bank.go:291,bank.go:398,acceptance.go:59: stall detected at round 1, no forward progress for 30s

--- FAIL: roachtest/acceptance/bank/node-restart (32.231s)
	test.go:503,bank.go:291,bank.go:398,acceptance.go:59: stall detected at round 1, no forward progress for 30s

Please assign, take a look and update the issue accordingly.

The text was updated successfully, but these errors were encountered:

petermattis · 2018-09-12T13:45:22Z

#29678 also has a failure for this test, though a different failure mode.

petermattis · 2018-09-12T15:23:30Z

I'm unable to reproduce this failure locally. Seems to have happened several times on CI.

petermattis · 2018-09-12T15:29:41Z

It's interesting that the stall occurs on round 1. Looks like the cluster never came up. The logs on the nodes don't show anything interesting:

I180911 14:27:07.082225 1 util/log/clog.go:1151  [config] file created at: 2018/09/11 14:27:07
I180911 14:27:07.082225 1 util/log/clog.go:1151  [config] running on machine: c688e08a1da4
I180911 14:27:07.082225 1 util/log/clog.go:1151  [config] binary: CockroachDB CCL v2.2.0-alpha.00000000-670-g83a7cb5 (x86_64-pc-linux-gnu, built 2018/09/11 14:25:15, go1.10.3)
I180911 14:27:07.082225 1 util/log/clog.go:1151  [config] arguments: [/home/roach/local/1/cockroach start --insecure --store=path=/home/roach/local/1/data --log-dir=/home/roach/local/1/logs --background --cache=6% --max-sql-memory=6% --port=26257 --http-port=26258 --locality=region=local,zone=local --host=127.0.0.1]
I180911 14:27:07.082225 1 util/log/clog.go:1151  line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg utf8=âœ“

petermattis · 2018-09-12T15:41:51Z

Here's something odd. The test indicated that it started the cluster, then stopped node 3 and restarted it. Yet I see 2 log files on nodes 1, 2 and 4 while I only expect to see 1. Simple manual test of this scenario shows only 1 log produced on the nodes that were started once.

Oh, it looks like one of the log files is from the invocation of cockroach with the --background flag. That doesn't seem to reproduce on Mac OS X. Perhaps a Linux specific bug. Unlikely to be the cause of the stall.

petermattis · 2018-09-25T17:42:36Z

While investigating another acceptance failure, I reproduced the problem seen in the original message:

--- FAIL: acceptance/bank/node-restart (47.53s)
        test.go:503,bank.go:291,bank.go:402,acceptance.go:60: stall detected at round 6, no forward progress for 30s
> /usr/local/bin/roachprod destroy local
local: stopping
local: wipingclient 1 shutting down

Unfortunately, I didn't specify the --debug flag so the cluster was wiped. The reproduction occurred on a 24-CPU GCE worker machine running roachtest -l --count=100 acceptance/bank.

tbg · 2018-09-27T07:55:36Z

Starting to repro this too.

--- FAIL: acceptance/bank/node-restart (32.14s)
	test.go:503,bank.go:291,bank.go:402,acceptance.go:60: stall detected at round 1, no forward progress for 30s

Don't think a single write to the DB has happened (the db is running happily and has a fresh accounts table). Going to add some stack traces and retry.

tbg · 2018-09-27T08:24:19Z

Hmm, looking at a stuck run in round 18.

The cluster is happy, and the accounts table has only zeroes as well. Wouldn't be surprised if the test is doing something silly.

tbg · 2018-09-27T09:58:26Z

I CTRL+C'ed a test and can't sure whether this is why that happened, but I got stuck with this stack trace:

oroutine 59 [IO wait, 1 minutes]:
internal/poll.runtime_pollWait(0x7fc2683adc90, 0x77, 0xc42062eea0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc42068d518, 0x77, 0xc42062ee00, 0xc42062eef0, 0xc42062ee01)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitWrite(0xc42068d518, 0xc42068d500, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:94 +0x3d
internal/poll.(*FD).WaitWrite(0xc42068d500, 0xc420042070, 0xc420042070)
	/usr/local/go/src/internal/poll/fd_unix.go:440 +0x37
net.(*netFD).connect(0xc42068d500, 0x1b95b80, 0xc420042070, 0x0, 0x0, 0x1b7f8c0, 0xc4202f7880, 0x0, 0x0, 0x0, ...)
	/usr/local/go/src/net/fd_unix.go:152 +0x299
net.(*netFD).dial(0xc42068d500, 0x1b95b80, 0xc420042070, 0x1b9d080, 0x0, 0x1b9d080, 0xc4206b8270, 0xc42062f0f0, 0x75d83e)
	/usr/local/go/src/net/sock_posix.go:142 +0xe9
net.socket(0x1b95b80, 0xc420042070, 0x19adb98, 0x3, 0x2, 0x1, 0x0, 0x0, 0x1b9d080, 0x0, ...)
	/usr/local/go/src/net/sock_posix.go:93 +0x1a6
net.internetSocket(0x1b95b80, 0xc420042070, 0x19adb98, 0x3, 0x1b9d080, 0x0, 0x1b9d080, 0xc4206b8270, 0x1, 0x0, ...)
	/usr/local/go/src/net/ipsock_posix.go:141 +0x129
net.doDialTCP(0x1b95b80, 0xc420042070, 0x19adb98, 0x3, 0x0, 0xc4206b8270, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/tcpsock_posix.go:62 +0xb9
net.dialTCP(0x1b95b80, 0xc420042070, 0x19adb98, 0x3, 0x0, 0xc4206b8270, 0x1d6c7b5dbbd31, 0x2008ed4, 0x2008ed401b95b80)
	/usr/local/go/src/net/tcpsock_posix.go:58 +0xe4
net.dialSingle(0x1b95b80, 0xc420042070, 0xc42068d480, 0x1b83ec0, 0xc4206b8270, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/dial.go:547 +0x375
net.dialSerial(0x1b95b80, 0xc420042070, 0xc42068d480, 0xc4202bc360, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/dial.go:515 +0x22d
net.(*Dialer).DialContext(0xc42062f630, 0x1b95b80, 0xc420042070, 0x19adb98, 0x3, 0xc4202e8940, 0xf, 0x0, 0x0, 0x0, ...)
	/usr/local/go/src/net/dial.go:397 +0x678
net.(*Dialer).Dial(0xc42062f630, 0x19adb98, 0x3, 0xc4202e8940, 0xf, 0x5ff7a7, 0x0, 0xc42062f660, 0x3)
	/usr/local/go/src/net/dial.go:320 +0x75
net.Dial(0x19adb98, 0x3, 0xc4202e8940, 0xf, 0xbeb668, 0x5c5920, 0x19bd16a, 0x19bd197)
	/usr/local/go/src/net/dial.go:291 +0x99
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.defaultDialer.Dial(0x19adb98, 0x3, 0xc4202e8940, 0xf, 0x182d8c0, 0xc4206247e0, 0xc420624838, 0x10)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:95 +0x49
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.dial(0x1b85040, 0x28808a8, 0xc4206b8150, 0x0, 0xc4206248a8, 0x1, 0x2)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:379 +0x35e
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.DialOpen(0x1b85040, 0x28808a8, 0xc4204ca1c1, 0x2f, 0x0, 0x0, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:338 +0x553
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.Open(0xc4204ca1c1, 0x2f, 0xc42062fa60, 0x5fc165, 0xc420827500, 0xc420646000)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:251 +0x4d
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*Driver).Open(0x28808a8, 0xc4204ca1c1, 0x2f, 0x9325bd, 0xc40000003a, 0xc4206b80c0, 0x86)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:45 +0x35
database/sql.dsnConnector.Connect(0xc4204ca1c1, 0x2f, 0x1b7e8a0, 0x28808a8, 0x1b95b80, 0xc420042070, 0x0, 0x0, 0xc420545f00, 0x7fc268400000)
	/usr/local/go/src/database/sql/sql.go:600 +0x45
database/sql.(*DB).conn(0xc4204ba8c0, 0x1b95b80, 0xc420042070, 0xc42062fd01, 0x94595c, 0xc4204bb540, 0xc4206246c0)
	/usr/local/go/src/database/sql/sql.go:1103 +0x131
database/sql.(*DB).exec(0xc4204ba8c0, 0x1b95b80, 0xc420042070, 0x1a1cf11, 0xb5, 0xc42062fee8, 0x3, 0x3, 0x4db34b9682630c01, 0xc42062fdb8, ...)
	/usr/local/go/src/database/sql/sql.go:1367 +0x66
database/sql.(*DB).ExecContext(0xc4204ba8c0, 0x1b95b80, 0xc420042070, 0x1a1cf11, 0xb5, 0xc42062fee8, 0x3, 0x3, 0x10, 0x1756f20, ...)
	/usr/local/go/src/database/sql/sql.go:1349 +0xe1
database/sql.(*DB).Exec(0xc4204ba8c0, 0x1a1cf11, 0xb5, 0xc42062fee8, 0x3, 0x3, 0xc42062fed0, 0x6b3676, 0xbee34825c2005275, 0xb6362116)
	/usr/local/go/src/database/sql/sql.go:1363 +0x85
main.(*bankClient).transferMoney(0xc420447950, 0x3e7, 0x3e7, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:63 +0x206
main.(*bankState).transferMoney(0xc42042cd70, 0x1b95b40, 0xc4204c2640, 0xc4205ee0c0, 0x1, 0x3e7, 0x3e7)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:152 +0xb6
created by main.runBankNodeRestart
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:390 +0x3af

tbg · 2018-09-27T10:03:06Z

Hrm, this test is a bit nonsensical, look at this query:

UPDATE bank.accounts
   SET balance = CASE id WHEN $1 THEN balance-$3 WHEN $2 THEN balance+$3 END
 WHERE id IN ($1, $2) AND (SELECT balance >= $3 FROM bank.accounts WHERE id = $1)

This is a noop if the withdrawal account $1 doesn't have an amount of $3 on it, but how would it? We never initialize any account with a nonzero value in this test.

petermattis · 2018-09-27T11:24:49Z

this test is a bit nonsensical

Did I cut&paste something incorrectly from the old acceptance test? Doesn't look like it.

tbg · 2018-09-27T11:37:04Z

Got the stall error too, on the first round. The cluster is healthy and happy to respond to queries.

=== RUN   acceptance/bank/node-restart
> /home/tschottdorf/go/bin/roachprod wipe local
local: stopping
local: wiping
> /home/tschottdorf/go/bin/roachprod put local /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/cockroach ./cockroach
local: putting /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/cockroach ./cockroach

   1: done
   2: done
   3: done
   4: done
> /home/tschottdorf/go/bin/roachprod start local
local: starting
local: initializing cluster settings
SET CLUSTER SETTING
monkey starts (seed -3775595060315039375)
round 1: restarting nodes [2]
round 1: restarting 2
> /home/tschottdorf/go/bin/roachprod stop local:2
local: stopping
> /home/tschottdorf/go/bin/roachprod start local:2
local: starting
round 1: monkey sleeping while cluster recovers...
round 1: client counts: (0)
goroutine 31 [running]:
runtime/debug.Stack(0x6b3676, 0xbee34ddc05743323, 0x771c1682b)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x22
main.(*bankState).waitClientsStop(0xc42042c0f0, 0x1b95e00, 0xc42002a200, 0xc42038fe40, 0xc4208d20c0, 0x6fc23ac00)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:299 +0x76e
main.runBankNodeRestart(0x1b95e00, 0xc42002a200, 0xc42038fe40, 0xc4208d20c0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:418 +0x662
main.registerAcceptance.func1(0x1b95e00, 0xc42002a200, 0xc42038fe40, 0xc4208d20c0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:60 +0x8e
main.(*registry).run.func2(0x1a758b0, 0xc42038fe40, 0xc420292780, 0xc4204e43e0, 0xc42000e098, 0x1b95e00, 0xc420661300, 0xc420823400)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test.go:792 +0x458
created by main.(*registry).run
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test.go:638 +0x2ed
round 1: not waiting for recovery due to signal that we're done
> /home/tschottdorf/go/bin/roachprod get local logs artifacts/acceptance/bank/node-restart/logs
local: getting logs artifacts/acceptance/bank/node-restart/logs

   1: done
   2: done
   3: done
   4: done
--- FAIL: acceptance/bank/node-restart (32.18s)
	test.go:503,bank.go:300,bank.go:418,acceptance.go:60: stall detected at round 1, no forward progress for 30s

tbg · 2018-09-27T11:37:43Z

Did I cut&paste something incorrectly from the old acceptance test? Doesn't look like it.

Might've been nonsensical for a long time. We just need to initialize with, say, 100, instead of a zero balance and it'll actually do something. But first let's deflake it...

tbg · 2018-09-27T11:56:20Z

The "stall detected" message comes with this goroutine stack (the other goroutines are doing harmless stuff):

goroutine 47 [IO wait]:
internal/poll.runtime_pollWait(0x7f7105916bc0, 0x72, 0xc420233388)
	/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc4205d4d18, 0x72, 0xffffffffffffff00, 0x1b81060, 0x2607d20)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc4205d4d18, 0xc420575000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc4205d4d00, 0xc420575000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc4205d4d00, 0xc420575000, 0x1000, 0x1000, 0xc42005e570, 0xc42005e500, 0xc4208ca098)
	/usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc42033e008, 0xc420575000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:176 +0x6a
bufio.(*Reader).Read(0xc4202c2180, 0xc4201302e0, 0x5, 0x200, 0x50, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:216 +0x238
io.ReadAtLeast(0x1b7c2c0, 0xc4202c2180, 0xc4201302e0, 0x5, 0x200, 0x5, 0x8, 0x18, 0xc4202943e0)
	/usr/local/go/src/io/io.go:309 +0x86
io.ReadFull(0x1b7c2c0, 0xc4202c2180, 0xc4201302e0, 0x5, 0x200, 0x1ff, 0x50, 0xc420233620)
	/usr/local/go/src/io/io.go:327 +0x58
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).recvMessage(0xc4201302c0, 0xc4202943e0, 0x1ff, 0x50, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:947 +0xfe
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).recv(0xc4201302c0, 0xc420233768, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:977 +0x7c
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*conn).startup(0xc4201302c0, 0xc4204c25d0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:1099 +0x962
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.DialOpen(0x1b85400, 0x28808a8, 0xc42043c3c1, 0x2f, 0x0, 0x0, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:344 +0x646
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.Open(0xc42043c3c1, 0x2f, 0x0, 0x0, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:251 +0x4d
github.com/cockroachdb/cockroach/vendor/github.com/lib/pq.(*Driver).Open(0x28808a8, 0xc42043c3c1, 0x2f, 0x0, 0x0, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:45 +0x35
database/sql.dsnConnector.Connect(0xc42043c3c1, 0x2f, 0x1b7ec60, 0x28808a8, 0x1b95f40, 0xc420042070, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/database/sql/sql.go:600 +0x45
database/sql.(*DB).conn(0xc4203d8820, 0x1b95f40, 0xc420042070, 0x1, 0x0, 0x0, 0x0)
	/usr/local/go/src/database/sql/sql.go:1103 +0x131
database/sql.(*DB).exec(0xc4203d8820, 0x1b95f40, 0xc420042070, 0x1a1d299, 0xb5, 0xc420233ee8, 0x3, 0x3, 0x7f7105974f01, 0xc42047d505, ...)
	/usr/local/go/src/database/sql/sql.go:1367 +0x66
database/sql.(*DB).ExecContext(0xc4203d8820, 0x1b95f40, 0xc420042070, 0x1a1d299, 0xb5, 0xc420233ee8, 0x3, 0x3, 0x10, 0x17572a0, ...)
	/usr/local/go/src/database/sql/sql.go:1349 +0xe1
database/sql.(*DB).Exec(0xc4203d8820, 0x1a1d299, 0xb5, 0xc42047d6e8, 0x3, 0x3, 0x3a98cde8, 0xc42047d6d0, 0x6b3676, 0xbee34edc50d05a97)
	/usr/local/go/src/database/sql/sql.go:1363 +0x85
main.(*bankClient).transferMoney(0xc4204c3590, 0x3e7, 0x3e7, 0x0, 0x0)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:65 +0x212
main.(*bankState).transferMoney(0xc4204c0cd0, 0x1b95f00, 0xc4204be580, 0xc4204da0c0, 0x1, 0x3e7, 0x3e7)
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:159 +0xba
created by main.runBankNodeRestart
	/home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/bank.go:410 +0x48a

The server is happy. I'm afraid this means that what I want is a stack trace of the node the client is talking to. Let's see.

tbg · 2018-09-27T12:11:51Z

Hmm. I had what I thought was a failing test run:

test finished, waiting for shutdown of 4 clients
round 26: not waiting for recovery due to signal that we're done
[   2] acceptance/bank/node-restart: ??? (0s)
[   3] acceptance/bank/node-restart: ??? (0s)
[   4] acceptance/bank/node-restart: ??? (0s)
[   5] acceptance/bank/node-restart: ??? (0s)
client 1 shutting down
11666 transfers (36.4/sec) in 320.4s

The client 1 shutting down line coincided exactly with my running select * from bank.accounts where balance = 0; against the cluster. Maybe a coincidence. I'll do the same next time.

tbg · 2018-09-27T12:17:09Z

Yeah - the same thing happened again. client 1 shutting down precisely as I query the table.

round 14: monkey sleeping while cluster recovers...
round 14: client counts: (8977)
[   1] acceptance/bank/node-restart: ??? (0s)
round 14: not waiting for recovery due to signal that we're done
[   2] acceptance/bank/node-restart: ??? (0s)
client 1 shutting down
8978 transfers (55.3/sec) in 162.3s
> /home/tschottdorf/go/bin/roachprod get local logs artifacts/acceptance/bank/node-restart/logs
local: getting logs artifacts/acceptance/bank/node-restart/logs

   1: done
   2: done
   3: done
   4: done
--- PASS: acceptance/bank/node-restart (164.61s)

tbg · 2018-09-27T12:49:04Z

Next hang. This time, the cluster is not fine (connecting via the SQL shell times out).

[   1] acceptance/bank/node-restart: ??? (0s)
test finished, waiting for shutdown of 4 clients
round 25: not waiting for recovery due to signal that we're done
[   2] acceptance/bank/node-restart: ??? (0s)
[   3] acceptance/bank/node-restart: ??? (0s)
[   4] acceptance/bank/node-restart: ??? (0s)
[   5] acceptance/bank/node-restart: ??? (0s)
[   6] acceptance/bank/node-restart: ??? (0s)
[   7] acceptance/bank/node-restart: ??? (0s)
[   8] acceptance/bank/node-restart: ??? (0s)
[   9] acceptance/bank/node-restart: ??? (0s)

Okay, here's what the node4 (the one the client connects to) say about this connection:

goroutine 547 [IO wait]:
internal/poll.runtime_pollWait(0x7f022e731bc0, 0x72, 0xc429fd1730)
        /usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc42061c518, 0x72, 0xffffffffffffff00, 0x2f2a900, 0x4142248)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9b
internal/poll.(*pollDesc).waitRead(0xc42061c518, 0xc42526c000, 0x1000, 0x1000)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc42061c500, 0xc42526c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:157 +0x17d
net.(*netFD).Read(0xc42061c500, 0xc42526c000, 0x1000, 0x1000, 0x0, 0xc429fd1868, 0x82419c)
        /usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc420842038, 0xc42526c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:176 +0x6a
github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux.(*MuxConn).Read(0xc4208245a0, 0xc42526c000, 0x1000, 0x1000, 0x0, 0
x0, 0x0)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/cmux/cmux.go:218 +0x119
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*readTimeoutConn).Read(0xc420389280, 0xc42526c000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:1372 +0x110
bufio.(*Reader).fill(0xc425232060)
        /usr/local/go/src/bufio/bufio.go:100 +0x11e
bufio.(*Reader).ReadByte(0xc425232060, 0xbee351ab872a5d51, 0x4c92b6824, 0x46c8ca0)
        /usr/local/go/src/bufio/bufio.go:242 +0x39
github.com/cockroachdb/cockroach/pkg/sql/pgwire/pgwirebase.(*ReadBuffer).ReadTypedMsg(0xc4252321d0, 0x2f3ab20, 0xc425232060, 0x2f34220
, 0x4717650, 0x0, 0x0)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/pgwirebase/encoding.go:120 +0x35
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl(0xc425232000, 0x2f51de0, 0xc420a16400, 0xc4202bb6b0, 0xc420890100, 0
x5400, 0x15000, 0xc4205d9c70, 0xc420b38090, 0x0, ...)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:285 +0x5bd
github.com/cockroachdb/cockroach/pkg/sql/pgwire.serveConn(0x2f51de0, 0xc420a16380, 0x2f6cb00, 0xc4208245a0, 0x0, 0x0, 0xc42522e01e, 0x4, 0x0, 0x0, ...)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:163 +0x207
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*Server).ServeConn(0xc4205d9b00, 0x2f51de0, 0xc420a16380, 0x2f6cb00, 0xc4208245a0, 0x0, 0x0)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/server.go:453 +0x711
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start.func19.1(0x2f6cb00, 0xc4208245a0)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1631 +0x180
github.com/cockroachdb/cockroach/pkg/util/netutil.(*Server).ServeWith.func1(0xc420b38090, 0x2f51ea0, 0xc42070c3f0, 0xc420304420, 0x2f6cb00, 0xc4208245a0, 0xc42070c420)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/netutil/net.go:141 +0xa9
created by github.com/cockroachdb/cockroach/pkg/util/netutil.(*Server).ServeWith
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/netutil/net.go:138 +0x249

tbg · 2018-09-27T12:53:28Z

Interesting. I can actually connect to the SQL shell and query the table on n4, but not on n1 and n2. Then I tried n1 again and it worked. n2 still doesn't. Hmm. I wonder if this has something to do with it:

$ lsof | grep tschottdorf | wc -l
11562

tbg · 2018-09-27T12:57:57Z

Restricting to network file descriptors there are only a few dozen which seems reasonable, so back to the nodes itself... n1 is clearly unhappy:

goroutine 26 [select, 19 minutes]:
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).lookupRangeDescriptorInternal(0xc420824550, 0x2f51ea0, 0xc420a90180, 0xc420309c50, 0x10, 0x10, 0x0, 0x0, 0x0, 0x0, ...)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:356 +0x4f7
github.com/cockroachdb/cockroach/pkg/kv.(*RangeDescriptorCache).LookupRangeDescriptor(0xc420824550, 0x2f51ea0, 0xc420a90180, 0xc420309c50, 0x10, 0x10, 0x0, 0x3fc3333333333300, 0xc420b36540, 0x0, ...)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/kv/range_cache.go:241 +0x92
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).getDescriptor(0xc420264600, 0x2f51ea0, 0xc420a90180, 0xc420309c50, 0x10, 0x10, 0x0, 0xc420b36500, 0x2faf080, 0x3b9aca00, ...)

nvanbenschoten · 2018-10-12T15:28:59Z

Do you have evidence that we're sending super-large batches on the Raft transport?

No, and that theory didn't pan out. I still see all followers slowly fall behind the leader under heavy load.

nvanbenschoten · 2018-10-12T23:27:29Z

Everything I've written about so far is concerning and deserves continued investigation. It is pretty easy to reproduce with #31081 if you disable nobarrier and use Jaeger to watch for long requests. However, I think I've convinced myself that it's not the issue we see here in this test with the stalls because we don't see any of its symptoms in the log. I've opened #31330 to continue tracking it.

To recap, we've fixed three separate stalls so far and the test passes reliably for about 2-3 hours of stress. When it does fail, it doesn't stall indefinitely as it did with the other issues. Instead, it stalls for only about 60 seconds. I suspect that the following log explains these 60-second stalls. We see these on n4, which is serving as the SQL gateway in the test, whenever it fails.

E181012 22:26:54.548553 99 sqlmigrations/migrations.go:370  [n4] failed attempt to acquire migration lease: lease /System/"system-version/lease" is not available until at least 1539383274.491341436,0

The logs continue for exactly a minute before going away. These logs originate from EnsureMigrations, and they indicate that the Server is not yet accepting SQL connections. I suspect that the issue here is that we kill a node which has acquired but not yet released thie sql migration lease. To back this up, I see that n2 was killed immediately after logging

I181012 22:26:54.413824 90 cli/start.go:507  initial startup completed, will now wait for `cockroach init`

We then establish a SQL connection to a node stuck trying to acquire this lease and needing to wait for the abandoned lease to expire. I don't have a good solution here in the general case, but there are a few things we could do in the test to fix this and prevent the flake:

we could round-robin the SQL connection instead of only using a single node that could get stuck
we could ensure that all nodes can serve SQL connections before begining the chaos monkey
we could adjust roachprod to wait until server initialization is complete before returning from roachprod start

I think the second option is the way to go, but I want to make sure it's not undermining the purpose of the test. @petermattis could you weigh in on that? Are we trying to test everything from the server startup process onwards while under chaos?

Also for what it's worth, I only noticed this because I searched for ^E in the logs. I don't think I would have caught it if my branch had included #31270. Depending on the outcome of this investigation, we may want to revert that.

petermattis · 2018-10-13T00:44:51Z

Interesting. Is it really an inability to get the migrations lease which is causing these 60s stalls? If we disable that code do the stalls go away? Why are we even trying to get the migrations lease? Looks to me like EnsureMigrations only tries to get the lease if there are migrations to run. So what migration hasn't been run given that the cluster is created and only ever using a single version of cockroach?

Of the mitigation options, the second sounds the best.

tbg · 2018-10-13T18:19:36Z

I'm also confused why this is happening. Are there really any outstanding migrations? If so, do you see this only just after the cluster got booted up? Something seems fishy here.

nvanbenschoten · 2018-10-14T03:46:38Z

If we disable that code do the stalls go away?

I'm testing that right now. I suspect that they will unless there's another class of stalls.

Why are we even trying to get the migrations lease? Looks to me like EnsureMigrations only tries to get the lease if there are migrations to run. So what migration hasn't been run given that the cluster is created and only ever using a single version of cockroach?

It doesn't look like there is any protection against a fresh cluster running these SQL migrations, and I can certianly see all backwardCompatibleMigrations with a workFn run in a fresh cluster if I drop some logging into that area of the code. The cluster will search for migrations keys, find none, and run all available migrations. I supose they should all be quick because they are mostly all no-ops, and it's important that they create migration keys as they complete. @benesch do you know whether this was intentional, an oversight, or somewhere in between?

benesch · 2018-10-14T05:09:06Z

It doesn't look like there is any protection against a fresh cluster running these SQL migrations, and I can certianly see all backwardCompatibleMigrations with a workFn run in a fresh cluster if I drop some logging into that area of the code. The cluster will search for migrations keys, find none, and run all available migrations. I supose they should all be quick because they are mostly all no-ops, and it's important that they create migration keys as they complete. @benesch do you know whether this was intentional, an oversight, or somewhere in between?

It's intentional. I've never liked it much, but we use migrations as a convenient place to run cluster bootstrapping code (i.e., code that must be run at least once when a cluster is first created). For example:

cockroach/pkg/sqlmigrations/migrations.go

Lines 95 to 99 in 3960772

 { 

 // Introduced in v1.1. Permanent migration. 

 name: "populate initial version cluster setting table entry", 

 workFn: populateVersionSetting, 

 },

(which I think Tobi added actually 😀)

benesch · 2018-10-14T05:10:12Z

Also, even without these "permanent" migrations, fresh clusters still tend to have migrations:

cockroach/pkg/sql/sqlbase/system.go

Lines 866 to 872 in 3960772

 // Adding a new system table? Don't add it to the metadata schema yet! 

 // 

 // The first release to contain the system table must add the system table 

 // via a migration in both fresh clusters and existing clusters. Only in the 

 // next release should you "bake in" the migration by adding the table to 

 // the metadata schema here. This ensures there's only ever one code path 

 // responsible for creating the table.

tbg · 2018-10-15T09:09:49Z

Somehow I thought you said that there was a lease acquired every time a node started, but I just verified (and it seems that you're all implying that) this only happens once at the beginning of a new cluster (assuming all the migrations get applied before something gets killed). I think it's fine if we just work around that then. Should be enough to SELECT 1 in the monkey before going for the kill, right?

nvanbenschoten · 2018-10-15T11:02:20Z

Should be enough to SELECT 1 in the monkey before going for the kill, right?

Yeah exactly, and in doing so I've only had a single stall (which was persistent) in the last 36 hours of stressing the test. I'll make that change, fix the proposal forwarding issues now that etcd-io/etcd#10167 is merged, and rebase the test on master to continue the hunt. It's possible we've already fixed this stall, and if not, it appears to be exceedingly rare.

petermattis · 2018-10-15T13:45:28Z

Should be enough to SELECT 1 in the monkey before going for the kill, right?

I think it would be sufficient to verify all nodes can serve a SELECT 1 before starting up the test proper. This doesn't have to be done before every node kill, though I suppose that doesn't hurt.

nvanbenschoten · 2018-10-15T14:07:44Z

I was doing it at the beginning of bankState.chaosMonkey, which has the effect you're describing.

Informs cockroachdb#30064. This prevents us from running into issues with SQL migrations. Release note: None

30989: roachtest/bank: don't start chaos until nodes are serving SQL r=nvanbenschoten a=nvanbenschoten Informs #30064. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

31408: storage: replace remote proposal tracking with uncommitted log size protection r=nvanbenschoten a=nvanbenschoten Closes #30064. This change reverts most of the non-testing code from 03b116f and f2f3fd2 and replaces it with use of the MaxUncommittedEntriesSize config. This configuration was added in etcd-io/etcd#10167 and provides protection against unbounded Raft log growth when a Raft group stops being able to commit entries. It makes proposals into Raft safer because proposers don't need to verify before the fact that the proposal isn't a duplicate that might be blowing up the size of the Raft group. By default, the configuration is set to double the Replica's proposal quota. The logic here is that the quotaPool should be responsible for throttling proposals in all cases except for unbounded Raft re-proposals because it queues efficiently instead of dropping proposals on the floor indiscriminately. Release note (bug fix): Fix a bug where Raft proposals could get stuck if forwarded to a leader who could not itself append a new entry to its log. This will be backported, but not to 2.1.0. The plan is to get it into 2.1.1. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten · 2018-10-18T00:03:13Z

Yeah exactly, and in doing so I've only had a single stall (which was persistent) in the last 36 hours of stressing the test.

I'm not confident I actually had all of the fix from #31408 in place when I hit this stall, and after ensuring that, I haven't been able to trigger a stall again in more than 24 hours. I'm closing this for now but will keep stressing it for the next few days. We'll see if it ever comes up again.

nvanbenschoten · 2018-10-19T20:26:19Z

4 days of continuous stress without a failure. I'm declaring this stable.

petermattis · 2018-10-19T20:31:01Z

4 days of continuous stress without a failure. I'm declaring this stable.

🎉 🎉 🎉

This was forgotten, cockroachdb#30064 is long fixed :fingers_crossed: Release note: None

33549: acceptance: re-enable bank/zerosum-restart r=petermattis a=tbg This was forgotten, #30064 is long fixed :fingers_crossed: Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

cockroach-teamcity added this to the 2.2 milestone Sep 11, 2018

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Sep 11, 2018

petermattis self-assigned this Sep 12, 2018

petermattis modified the milestones: 2.2, 2.1 Sep 12, 2018

tbg changed the title ~~teamcity: failed tests on master: roachtest/acceptance~~ teamcity: failed tests: cluster-recovery Sep 25, 2018

tbg changed the title ~~teamcity: failed tests: cluster-recovery~~ teamcity: failed tests: node-restart Sep 25, 2018

tbg added the A-kv-client Relating to the KV client and the KV interface. label Sep 25, 2018

This was referenced Sep 26, 2018

teamcity: failed test: bank/node-restart #30662

Closed

teamcity: failed test: bank/node-restart #30628

Closed

tbg changed the title ~~teamcity: failed tests: node-restart~~ teamcity: failed test: bank/node-restart Sep 26, 2018

nvanbenschoten mentioned this issue Oct 12, 2018

storage: divergent entry application can lead to request stalls on Range splits #31330

Closed

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 15, 2018

roachtest/bank: don't start chaos until nodes are serving SQL

5bd0dec

Informs cockroachdb#30064. This prevents us from running into issues with SQL migrations. Release note: None

petermattis unassigned nvanbenschoten Oct 15, 2018

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Oct 15, 2018

roachtest/bank: don't start chaos until nodes are serving SQL

b1ab7dd

Informs cockroachdb#30064. This prevents us from running into issues with SQL migrations. Release note: None

craig bot pushed a commit that referenced this issue Oct 15, 2018

Merge #30989

ae59e8e

30989: roachtest/bank: don't start chaos until nodes are serving SQL r=nvanbenschoten a=nvanbenschoten Informs #30064. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten mentioned this issue Oct 15, 2018

storage: replace remote proposal tracking with uncommitted log size protection #31408

Merged

a-robinson mentioned this issue Oct 16, 2018

sql: TestScatterRandomizeLeases failed under stress #31456

Closed

tbg mentioned this issue Oct 17, 2018

roachtest: election-after-restart failed #31531

Closed

craig bot closed this as completed in #31408 Oct 17, 2018

nvanbenschoten mentioned this issue Nov 26, 2018

release-2.1: storage: replace remote proposal tracking with uncommitted log size protection #32600

Merged

tbg mentioned this issue Jan 7, 2019

acceptance: re-enable bank/zerosum-restart #33549

Merged

tbg added a commit to tbg/cockroach that referenced this issue Jan 10, 2019

acceptance: re-enable bank/zerosum-restart

1b3fe9c

This was forgotten, cockroachdb#30064 is long fixed :fingers_crossed: Release note: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

teamcity: failed test: bank/node-restart #30064

teamcity: failed test: bank/node-restart #30064

cockroach-teamcity commented Sep 11, 2018

petermattis commented Sep 12, 2018

petermattis commented Sep 12, 2018

petermattis commented Sep 12, 2018

petermattis commented Sep 12, 2018

petermattis commented Sep 25, 2018

tbg commented Sep 27, 2018 •

edited

Loading

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

petermattis commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

nvanbenschoten commented Oct 12, 2018

nvanbenschoten commented Oct 12, 2018

petermattis commented Oct 13, 2018

tbg commented Oct 13, 2018

nvanbenschoten commented Oct 14, 2018

benesch commented Oct 14, 2018

benesch commented Oct 14, 2018

tbg commented Oct 15, 2018

nvanbenschoten commented Oct 15, 2018

petermattis commented Oct 15, 2018

nvanbenschoten commented Oct 15, 2018

nvanbenschoten commented Oct 18, 2018

nvanbenschoten commented Oct 19, 2018

petermattis commented Oct 19, 2018

teamcity: failed test: bank/node-restart #30064

teamcity: failed test: bank/node-restart #30064

Comments

cockroach-teamcity commented Sep 11, 2018

petermattis commented Sep 12, 2018

petermattis commented Sep 12, 2018

petermattis commented Sep 12, 2018

petermattis commented Sep 12, 2018

petermattis commented Sep 25, 2018

tbg commented Sep 27, 2018 • edited Loading

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

petermattis commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

tbg commented Sep 27, 2018

nvanbenschoten commented Oct 12, 2018

nvanbenschoten commented Oct 12, 2018

petermattis commented Oct 13, 2018

tbg commented Oct 13, 2018

nvanbenschoten commented Oct 14, 2018

benesch commented Oct 14, 2018

benesch commented Oct 14, 2018

tbg commented Oct 15, 2018

nvanbenschoten commented Oct 15, 2018

petermattis commented Oct 15, 2018

nvanbenschoten commented Oct 15, 2018

nvanbenschoten commented Oct 18, 2018

nvanbenschoten commented Oct 19, 2018

petermattis commented Oct 19, 2018

tbg commented Sep 27, 2018 •

edited

Loading