Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk-stalled/fuse/log=false,data=false failed #99215

Closed
cockroach-teamcity opened this issue Mar 22, 2023 · 5 comments
Closed

roachtest: disk-stalled/fuse/log=false,data=false failed #99215

cockroach-teamcity opened this issue Mar 22, 2023 · 5 comments
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 22, 2023

roachtest.disk-stalled/fuse/log=false,data=false failed with artifacts on release-23.1 @ 80c4895c566a7eaa6f16c4098980509dd3795ad7:

test artifacts and logs in: /artifacts/disk-stalled/fuse/log=false_data=false/run_1
(disk_stall.go:198).runDiskStalledDetection: connection to n1 is dead: driver: bad connection
(cluster.go:1969).Run: cluster.RunE: context canceled
(cluster.go:1969).Run: cluster.RunE: context canceled
(cluster.go:1969).Run: output in run_123522.366655401_n4_cockroach-workload-r: ./cockroach workload run kv --read-percent 50 --duration 10m --concurrency 256 --max-rate 2048 --tolerate-errors  --min-block-bytes=512 --max-block-bytes=512 {pgurl:1-3} returned: context canceled

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-25764

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 22, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 22, 2023
@blathers-crl blathers-crl bot added the T-storage Storage Team label Mar 22, 2023
@jbowens
Copy link
Collaborator

jbowens commented Mar 22, 2023

No disk stall should've been induced on n1 because this is the log=false, data=false variant. However, n1 experienced a disk stall and at the same time before it would've induced a disk stall had it been another variant of the test. This seems like a bug in the test.

F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958  file write stall detected: disk slowness detected: syncdata on file ‹/mnt/data1/cockroach/real/000011.log› has been ongoing for 11.5s
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !goroutine 48379 [running]:
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x1)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0x89
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0xc000c2d720, {{{0xc00027e0c0, 0x24}, {0x0, 0x0}, {0x5a64623, 0x1}, {0x5a64623, 0x1}}, 0x174ebdf00001a961, ...})
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/cockroach/pkg/util/log/clog.go:262 +0x97
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepthInternal({0x6f0d2a0, 0xc000b87140}, 0x2, 0x4, 0x0, 0x0?, {0x5a2ef11, 0x1d}, {0xc0085ade18, 0x1, ...})
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:106 +0x645
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/cockroach/pkg/util/log.logfDepth(...)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/cockroach/pkg/util/log/channels.go:39
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/util/log/log_channels_generated.go:848
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).makeMetricEtcEventListener.func3({{0xc0066bfe00?, 0x2?}, 0x0?, 0x4be545?})
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1232 +0x226
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble.TeeEventListener.func4({{0xc0066bfe00?, 0xc0048550c8?}, 0x80?, 0xc0014c7f00?})
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/event.go:689 +0x43
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/cockroach/pkg/storage.wrapFilesystemMiddleware.func1({0xc0066bfe00?, 0x4f51d6?}, 0x0?, 0x4f59f7?)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/cockroach/pkg/storage/pebble.go:668 +0x2a
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFS).Create.func2(0x3?, 0x1723bbbb80?)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:547 +0x34
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker.func1()
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:163 +0x1b5
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !created by github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).startTicker
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:143 +0x5d

Inadvertent disk stalls is also similar to #98202.

@jbowens jbowens added X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue labels Mar 22, 2023
@jbowens
Copy link
Collaborator

jbowens commented Mar 22, 2023

I'm suspicious that there's a bug somewhere in the vfs.FS stack.

The disk stall was detected during a syncdata of a WAL file: syncdata on file ‹/mnt/data1/cockroach/real/000011.log. The stack trace of the call to Fatal is the file's disk-health monitoring goroutine's stack. Since the SyncData call is stalled, we should see a separate goroutine stack for the SyncData in the goroutine dump.

We do see one goroutine waiting on a Fdatasync, but the stack clearly shows it's for an sstable, not a WAL file.

F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !goroutine 48694 [syscall]:
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !syscall.Syscall(0x43189561?, 0xc0050d5e70?, 0xc0050d5f08?, 0x11ca24a?)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	GOROOT/src/syscall/syscall_linux.go:68 +0x27
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !golang.org/x/sys/unix.Fdatasync(0x14702fe5ba?)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	golang.org/x/sys/unix/external/org_golang_x_sys/unix/zsyscall_linux.go:723 +0x30
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/vfs.(*linuxFile).SyncData(0x14702fe5ba?)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/default_linux_noarm.go:65 +0x1d
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).SyncData.func1()
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:232 +0x2e
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).timeDiskOp(0xc00bae5450, 0x3, 0xc0050d5fa8)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:280 +0xde
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).SyncData(0x1000?)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:231 +0x53
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/vfs.(*enospcFile).SyncData(0xc0047e7db8?)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_full.go:413 +0x22
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/vfs.(*syncingFile).Sync(0xc0099fbc00?)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/syncing_file.go:113 +0x4b
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/objstorage.(*fileBufferedWritable).Finish(0xc00ac5e840)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/objstorage/external/com_github_cockroachdb_pebble/objstorage/vfs_writable.go:44 +0x42
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/sstable.(*Writer).Close(0xc0047e7500)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/writer.go:2003 +0x1673
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble.(*DB).runCompaction.func6({0xc007b5df00, 0x13, 0x9?})
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2882 +0x2f6
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble.(*DB).runCompaction(0xc000fc0a00, 0xf, 0xc004818000)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:3157 +0x227c
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble.(*DB).compact1(0xc000fc0a00, 0xc004818000, 0x0)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2514 +0x1a5
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble.(*DB).compact.func1({0x6f0d2a0, 0xc00b24fda0})
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2485 +0xad
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !runtime/pprof.Do({0x6f0d230?, 0xc00007a088?}, {{0xc0002f5940?, 0x6f59f20?, 0xc006514000?}}, 0xc0014c1f88)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble.(*DB).compact(0xc34b00?, 0xc0067032f0?, 0xc0014c1fb8?)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2482 +0x6b
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !created by github.com/cockroachdb/pebble.(*DB).maybeScheduleCompactionPicker
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2203 +0x5ea

Did the WAL's Fdatasync successfully complete, but we somehow fataled anyways?

@jbowens
Copy link
Collaborator

jbowens commented Mar 22, 2023

The only LogWriter stack is sitting waiting for more data to flush; not syncing:

F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !goroutine 114 [sync.Cond.Wait, 1 minutes]:
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !sync.runtime_notifyListWait(0xc0004d39e0, 0x0)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	GOROOT/src/runtime/sema.go:517 +0x14c
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !sync.(*Cond).Wait(0xc001439de8?)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	GOROOT/src/sync/cond.go:70 +0x8c
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/record.(*flusherCond).Wait(...)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:203
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/record.(*LogWriter).flushLoop(0xc0004d3900, {0x5156280, 0xb33f650})
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:428 +0x63f
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !runtime/pprof.Do({0x6f0d230?, 0xc00007a088?}, {{0xc0002f58a0?, 0xcc530a6dbb769b89?, 0x86ef09232023ad5b?}}, 0xc001439fc0)
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !github.com/cockroachdb/pebble/record.NewLogWriter.func2()
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:351 +0x5c
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !created by github.com/cockroachdb/pebble/record.NewLogWriter
F230322 12:35:49.545683 48379 storage/pebble.go:1232 ⋮ [T1,n1] 958 !	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:350 +0x456

@jbowens
Copy link
Collaborator

jbowens commented Mar 22, 2023

We have an automatic goroutine dump from 12:35:42, ~7 seconds before the fatal.

We see the same compaction sstable Fdatasync then:

goroutine 48694 [syscall]:
syscall.Syscall(0x43189561?, 0xc0050d5e70?, 0xc0050d5f08?, 0x11ca24a?)
	GOROOT/src/syscall/syscall_linux.go:68 +0x27
golang.org/x/sys/unix.Fdatasync(0x14702fe5ba?)
	golang.org/x/sys/unix/external/org_golang_x_sys/unix/zsyscall_linux.go:723 +0x30
github.com/cockroachdb/pebble/vfs.(*linuxFile).SyncData(0x14702fe5ba?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/default_linux_noarm.go:65 +0x1d
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).SyncData.func1()
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:232 +0x2e
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).timeDiskOp(0xc00bae5450, 0x3, 0xc0050d5fa8)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:280 +0xde
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).SyncData(0x1000?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:231 +0x53
github.com/cockroachdb/pebble/vfs.(*enospcFile).SyncData(0xc0047e7db8?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_full.go:413 +0x22
github.com/cockroachdb/pebble/vfs.(*syncingFile).Sync(0xc0099fbc00?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/syncing_file.go:113 +0x4b
github.com/cockroachdb/pebble/objstorage.(*fileBufferedWritable).Finish(0xc00ac5e840)
	github.com/cockroachdb/pebble/objstorage/external/com_github_cockroachdb_pebble/objstorage/vfs_writable.go:44 +0x42
github.com/cockroachdb/pebble/sstable.(*Writer).Close(0xc0047e7500)
	github.com/cockroachdb/pebble/sstable/external/com_github_cockroachdb_pebble/sstable/writer.go:2003 +0x1673
github.com/cockroachdb/pebble.(*DB).runCompaction.func6({0xc007b5df00, 0x13, 0x9?})
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2882 +0x2f6
github.com/cockroachdb/pebble.(*DB).runCompaction(0xc000fc0a00, 0xf, 0xc004818000)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:3157 +0x227c
github.com/cockroachdb/pebble.(*DB).compact1(0xc000fc0a00, 0xc004818000, 0x0)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2514 +0x1a5
github.com/cockroachdb/pebble.(*DB).compact.func1({0x6f0d2a0, 0xc00b24fda0})
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2485 +0xad
runtime/pprof.Do({0x6f0d230?, 0xc00007a088?}, {{0xc0002f5940?, 0x6f59f20?, 0xc006514000?}}, 0xc0014c1f88)
	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
github.com/cockroachdb/pebble.(*DB).compact(0xc34b00?, 0xc0067032f0?, 0xc0014c1fb8?)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2482 +0x6b
created by github.com/cockroachdb/pebble.(*DB).maybeScheduleCompactionPicker
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/compaction.go:2203 +0x5ea

and a LogWriter goroutine stuck in SyncData:

goroutine 48380 [syscall]:
syscall.Syscall(0xc006891bc8?, 0x4f54f4?, 0x0?, 0xc006891bc8?)
	GOROOT/src/syscall/syscall_linux.go:68 +0x27
golang.org/x/sys/unix.Fdatasync(0x145873b73b?)
	golang.org/x/sys/unix/external/org_golang_x_sys/unix/zsyscall_linux.go:723 +0x30
github.com/cockroachdb/pebble/vfs.(*linuxFile).SyncData(0x145873b73b?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/default_linux_noarm.go:65 +0x1d
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).SyncData.func1()
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:232 +0x2e
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).timeDiskOp(0xc00d98c4b0, 0x3, 0xc006891c98)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:280 +0xde
github.com/cockroachdb/pebble/vfs.(*diskHealthCheckingFile).SyncData(0x0?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_health.go:231 +0x53
github.com/cockroachdb/pebble/vfs.(*enospcFile).SyncData(0x4f59f7?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/disk_full.go:413 +0x22
github.com/cockroachdb/pebble/vfs.(*syncingFile).Sync(0xc008f36738?)
	github.com/cockroachdb/pebble/vfs/external/com_github_cockroachdb_pebble/vfs/syncing_file.go:113 +0x4b
github.com/cockroachdb/pebble/record.(*LogWriter).syncWithLatency(0xc005438500)
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:545 +0x43
github.com/cockroachdb/pebble/record.(*LogWriter).flushPending(0xc005438500, {0xc00dcd7cc5, 0x8af, 0x4343}, {0xc00a7de480, 0x0, 0x500c26?}, 0x414, 0x410)
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:532 +0x1dc
github.com/cockroachdb/pebble/record.(*LogWriter).flushLoop(0xc005438500, {0x5156280, 0xb33f650})
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:466 +0x358
runtime/pprof.Do({0x6f0d230?, 0xc00007a088?}, {{0xc0002f58a0?, 0xc006514000?, 0xc00128b2c0?}}, 0xc006891fc0)
	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
github.com/cockroachdb/pebble/record.NewLogWriter.func2()
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:351 +0x5c
created by github.com/cockroachdb/pebble/record.NewLogWriter
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:350 +0x456

It looks like the LogWriter's syncdata got unwedged by the time the fatal's goroutine dump? Or somehow it missed it. Either way the earlier goroutine dump seems to indicate this was a legitimate disk stall.

@jbowens jbowens closed this as not planned Won't fix, can't repro, duplicate, stale Mar 22, 2023
@jbowens jbowens added X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 22, 2023
@exalate-issue-sync exalate-issue-sync bot added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. and removed X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue labels Mar 22, 2023
@jbowens jbowens added X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 22, 2023
@jbowens
Copy link
Collaborator

jbowens commented Mar 22, 2023

cc #97968 for tracking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
Archived in project
Development

No branches or pull requests

2 participants