Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: backupTPCC failed #84240

Closed
cockroach-teamcity opened this issue Jul 12, 2022 · 8 comments · Fixed by #84666
Closed

roachtest: backupTPCC failed #84240

cockroach-teamcity opened this issue Jul 12, 2022 · 8 comments · Fixed by #84666
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 12, 2022

roachtest.backupTPCC failed with artifacts on master @ 571bfa3afb3858ae84d8a8fcdbb0a38e058402a5:

test artifacts and logs in: /artifacts/backupTPCC/run_1
	monitor.go:127,backup.go:1095,test_runner.go:896: monitor failure: monitor task failed: dial tcp 34.139.144.3:26257: connect: connection refused
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerBackup.func4
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:1095
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor task failed
		Wraps: (5) dial tcp 34.139.144.3:26257
		Wraps: (6) connect
		Wraps: (7) connection refused
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *net.OpError (6) *os.SyscallError (7) syscall.Errno

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

Jira issue: CRDB-17540

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 12, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jul 12, 2022
@msbutler
Copy link
Collaborator

the test failed mid restore because the cluster lost connection to node 1, when node 1's vm killed the local cockroach process. The cockroach logs do not indicate why though.

I don't think an OOM occurred given the latest line in its health log (seems like there's surplus memory; no heap profs):

I220712 06:09:47.982405 295 2@server/status/runtime_log.go:47 ⋮ [n1] 143  runtime stats: 1.3 GiB RSS, 429 goroutines (stacks: 5.1 MiB), 106 MiB/243 MiB Go alloc/total(stale) (heap fragmentation: 44 MiB, heap reserv    ed: 40 MiB, heap released: 774 MiB), 862 MiB/1.0 GiB CGO alloc/total (1025.4 CGO/sec), 101.8/25.2 %(u/s)time, 0.0 %gc (192x), 7.7 MiB/19 MiB (r/w)net`

1.journactl.txt, indicates the vm killed the cockroach binary:

3808 Jul 12 06:09:58 teamcity-5718837-1657603045-24-n3cpu4-0001 bash[14579]: ./cockroach.sh: line 76: 14587 Aborted                 (core dumped) "${BINARY}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG     _DIR}/cockroach.stderr.log"
3809 Jul 12 06:09:58 teamcity-5718837-1657603045-24-n3cpu4-0001 bash[15773]: cockroach exited with code 134: Tue Jul 12 06:09:58 UTC 2022
3810 Jul 12 06:09:58 teamcity-5718837-1657603045-24-n3cpu4-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=134/n/a
3811 Jul 12 06:09:58 teamcity-5718837-1657603045-24-n3cpu4-0001 systemd[1]: cockroach.service: Failed with result 'exit-code'.

One hint in the logs indicates that node 1 was not able to scatter a request to other nodes before the process got killed, which could suggest that the vm killed the binary due to networking issues? Maybe?

1237 W220712 06:09:48.334932 44903 ccl/backupccl/split_and_scatter_processor.go:380 ⋮ [n1,job=0,f‹a1497b1f›] 1146  scatter returned node 0. Route span starting at ‹/Table/106/1› to current node 1

@msbutler
Copy link
Collaborator

msbutler commented Jul 12, 2022

aha! found a panic in cockroach-stderr.log that relates to my recent PR 🕺

This panic occurs while restore span restoreSpanEntry is being iterated through in processRestoreSpanEntry:

  • A valid() call returned false and an error, causing the function to return early without flushing the the latest batch of keys.
  • Then, when the deferred cleanup occurs, the PebbleIterator.Close() panics because the underlying iterator returned an error when it was closed by pebbleIterator.destroy()

The panic might be swallowing the the error returned, though it's curious that EOF is added to the panic message. I wonder if that means that Valid and Close returned an error merely because at least one of the SSTs were iterated through. To test, I plan to add some custom logging and rerun this roachtest tomorrow.

panic: EOF                                                                      
    9                                                                                 
   10 goroutine 45749 [running]:                                                      
   11 panic({0x483a900, 0xc0000a4120})                                                
   12         GOROOT/src/runtime/panic.go:1147 +0x3a8 fp=0xc0061cb528 sp=0xc0061cb468 pc=0x48a828
   13 github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).destroy(0xc0039f0000)
   14         github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:897 +0x1e5 fp=0xc0061cb668 sp=0xc0061cb528 pc=0x17a61a5
   15 github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).Close(0xc0039f0000)
   16         github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:280 +0x7d fp=0xc0061cb690 sp=0xc0061cb668 pc=0x17a213d
   17 github.com/cockroachdb/cockroach/pkg/storage.(*ReadAsOfIterator).Close(...)     
   18         github.com/cockroachdb/cockroach/pkg/storage/read_as_of_iterator.go:34  
   19 github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreDataProcessor).openSSTs.func2.1()
   20         github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_data_processor.go:301 +0x5c fp=0xc0061cb730 sp=0xc0061cb690 pc=0x3961e3c
   21 github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreDataProcessor).processRestoreSpanEntry(0xc007034a00, {0x6755af0, 0xc005533500}, 0x4891640, {{{{0xc0054d1540, 0x2, 0x8}, {0xc0054d1548, 0x2, 0x8}}, ...},       ...})
   22         github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_data_processor.go:455 +0x13c9 fp=0xc0061cbc40 sp=0xc0061cb730 pc=0x3963969
   23 github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreDataProcessor).runRestoreWorkers.func1.1(0xc00008d700, 0xc007034a00, {0x6755af0, 0xc005533500}, 0xc005e63400)
   24         github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_data_processor.go:372 +0xfe fp=0xc0061cbeb8 sp=0xc0061cbc40 pc=0x396233e
   25 github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreDataProcessor).runRestoreWorkers.func1({0x6755af0, 0xc005533500}, 0xf0f0000032fccac)
   26         github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_data_processor.go:384 +0xce fp=0xc0061cbf30 sp=0xc0061cbeb8 pc=0x39621ee
   27 github.com/cockroachdb/cockroach/pkg/util/ctxgroup.GroupWorkers.func1({0x6755af0, 0xc005533500})
   28         github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:178 +0x2b fp=0xc0061cbf58 sp=0xc0061cbf30 pc=0x11daa6b
   29 github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()          
   30         github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:169 +0x25 fp=0xc0061cbf78 sp=0xc0061cbf58 pc=0x11da905
   31 golang.org/x/sync/errgroup.(*Group).Go.func1()                                  
   32         golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:74 +0x64 fp=0xc0061cbfe0 sp=0xc0061cbf78 pc=0x11da0a4
   33 runtime.goexit()                                                                
   34         GOROOT/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0061cbfe8 sp=0xc0061cbfe0 pc=0x4bf321
   35 created by golang.org/x/sync/errgroup.(*Group).Go                               
   36         golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:71 +0xa9

@cockroach-teamcity
Copy link
Member Author

roachtest.backupTPCC failed with artifacts on master @ 687171ac6c2cd9992486bb3b8c9d252ac95ca1cd:

test artifacts and logs in: /artifacts/backupTPCC/run_1
	monitor.go:127,backup.go:1095,test_runner.go:896: monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 134)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerBackup.func4
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:1095
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 2: dead (exit status 134)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.backupTPCC failed with artifacts on master @ 88d3253301457ac57820e0f4a4fab8f74bf9f38b:

test artifacts and logs in: /artifacts/backupTPCC/run_1
	monitor.go:127,backup.go:1095,test_runner.go:896: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 134)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerBackup.func4
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:1095
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 3: dead (exit status 134)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.backupTPCC failed with artifacts on master @ e9ee21860458d997a8155734dc608cfcd050ef24:

test artifacts and logs in: /artifacts/backupTPCC/run_1
	monitor.go:127,backup.go:1095,test_runner.go:896: monitor failure: monitor task failed: pq: pausing due to error; use RESUME JOB to try to proceed once the issue is resolved, or CANCEL JOB to rollback: exhausted retries: importing 12 ranges: EOF
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerBackup.func4
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/backup.go:1095
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (4) monitor task failed
		Wraps: (5) pq: pausing due to error; use RESUME JOB to try to proceed once the issue is resolved, or CANCEL JOB to rollback: exhausted retries: importing 12 ranges: EOF
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *pq.Error

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@msbutler
Copy link
Collaborator

@jbowens @erikgrinaker The restore job failed because the pebbleIterator.Valid() returned false, EOF on some span. Why should Valid() return a non nil error if we've reached the end of a file? At the end of a file, I would expect Valid() to return false, <nil>, right?

I see the following in the go SDK, which suggest that EOF implies a graceful end to reading a file:

// Functions should return EOF only to signal a graceful end of input.
// If the EOF occurs unexpectedly in a structured data stream,
// the appropriate error is either ErrUnexpectedEOF or some other error
// giving more detail.
var EOF = errors.New("EOF")

@erikgrinaker
Copy link
Contributor

@jbowens Seems like we should interpret EOF like an exhausted iterator. Probably better to handle this at the Pebble layer, since it handles lower-level IO?

@jbowens
Copy link
Collaborator

jbowens commented Jul 18, 2022

Oops, missed the mention until now. An io.EOF while reading an sstable should only be surfaced when reading at the end of the file (eg, reading the footer table.go:312). All the block reads should be through ReadAt calls at block offsets. I don't think it's legal to return io.EOF from ReadAt unless actually reading at the tail of the file and it's exhausted: https://pkg.go.dev/io#ReaderAt

If EOF is encountered in a block in the middle of the file, it looks like corruption.

I think we need to track down where the io.EOF is originating and what read call is surfacing it.

msbutler added a commit to msbutler/cockroach that referenced this issue Jul 19, 2022
This PR refactors all call sites of ExternalSSTReader(), to support using the
new PebbleIterator, which has baked in range key support. Most notably, this
PR replaces the multiIterator used in the restore data processor with the
PebbleSSTIterator.

This patch is apart of a larger effort to teach backup and restore about MVCC
bulk operations. Next, the readAsOfIterator will need to learn how to
deal with range keys.

Informs cockroachdb#71155

This PR addresses a bug created in cockroachdb#83984: loop variables in
ExternalSSTReader were captured by reference, leading to roachtest failures
(cockroachdb#84240, cockroachdb#84162).

Informs #71155i

Fixes: cockroachdb#84240, cockroachdb#84162, cockroachdb#84181

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Jul 19, 2022
This PR refactors all call sites of ExternalSSTReader(), to support using the
new PebbleIterator, which has baked in range key support. Most notably, this
PR replaces the multiIterator used in the restore data processor with the
PebbleSSTIterator.

This patch is apart of a larger effort to teach backup and restore about MVCC
bulk operations. Next, the readAsOfIterator will need to learn how to
deal with range keys.

Informs cockroachdb#71155

This PR addresses a bug created in cockroachdb#83984: loop variables in
ExternalSSTReader were captured by reference, leading to roachtest failures
(cockroachdb#84240, cockroachdb#84162).

Informs #71155i

Fixes: cockroachdb#84240, cockroachdb#84162, cockroachdb#84181

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Jul 19, 2022
This PR refactors all call sites of ExternalSSTReader(), to support using the
new PebbleIterator, which has baked in range key support. Most notably, this
PR replaces the multiIterator used in the restore data processor with the
PebbleSSTIterator.

This patch is apart of a larger effort to teach backup and restore about MVCC
bulk operations. Next, the readAsOfIterator will need to learn how to
deal with range keys.

Informs cockroachdb#71155

This PR addresses a bug created in cockroachdb#83984: loop variables in
ExternalSSTReader were captured by reference, leading to roachtest failures
(cockroachdb#84240, cockroachdb#84162).

Informs #71155i

Fixes: cockroachdb#84240, cockroachdb#84162, cockroachdb#84181

Release note: none
craig bot pushed a commit that referenced this issue Jul 21, 2022
84452: sql: add troubleshooting mode session variable r=THardy98 a=THardy98

Resolves: #84429

This change introduces a `troubleshooting_mode_enabled` session
variable. When enabled, this session variable is intended to be used as
a way to avoid performing additional work on queries, particularly when
the cluster is experiencing issues/unavailability/failure. By default,
this session variable is disabled.  Currently, this session variable is
only used to avoid collecting/emitting telemetry data.

Release note (sql change): Introduce new `troubleshooting_mode_enabled`
session variable, to avoid doing additional work on queries when
possible (i.e. collection telemetry data). By default, this session
variable is disabled.

84666: storageccl: use the new PebbleIterator in ExternalSSTReader r=erikgrinaker a=msbutler

This PR refactors all call sites of ExternalSSTReader(), to support using the
new PebbleIterator, which has baked in range key support. Most notably, this
PR replaces the multiIterator used in the restore data processor with the
PebbleSSTIterator.

This patch is apart of a larger effort to teach backup and restore about MVCC
bulk operations. Next, the readAsOfIterator will need to learn how to
deal with range keys.

Informs #71155

This PR addresses a bug created in #83984: loop variables in
ExternalSSTReader were captured by reference, leading to roachtest failures
(#84240, #84162).

Informs #71155

Fixes: #84240, #84162, #84181

Release note: none

Co-authored-by: Thomas Hardy <thardy@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
@craig craig bot closed this as completed in 85cd5ab Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants