Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: FormatPrePebblev1Marked panics if sstable does not exist #2019

Closed
renatolabs opened this issue Oct 20, 2022 · 2 comments · Fixed by #2020
Closed

db: FormatPrePebblev1Marked panics if sstable does not exist #2019

renatolabs opened this issue Oct 20, 2022 · 2 comments · Fixed by #2020

Comments

@renatolabs
Copy link

Summary

We've been observing a Pebble crash in CRDB's upgrade tests. Specifically, while the cluster is upgrading from the 22.1 release to the current version (either master or the 22.2.0 release branch), a crash is non-deterministically observed in Pebble during the RatchetFormatMajorVersion call. The actual panic happens inside markFilesPrePebblev1.

Stack trace:

goroutine 1076065 [running]:
os.Exit(0x1)
	GOROOT/src/os/proc.go:70 +0x75
github.com/cockroachdb/pebble.defaultLogger.Fatalf({}, {0x5521568?, 0x45c8c5?}, {0xc01eb55260?, 0x4a0e980?, 0xc004c97300?})
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/logger.go:32 +0x49
github.com/cockroachdb/pebble/internal/base.MustExist({0x686d638, 0xc000782360}, {0xc00e011160, 0x1f}, {0x7f803d0c67d8, 0xa20ec38}, {0x67d0780, 0xc01c124498})
	github.com/cockroachdb/pebble/internal/base/external/com_github_cockroachdb_pebble/internal/base/filenames.go:167 +0x4fe
github.com/cockroachdb/pebble.(*tableCacheContainer).withReader(0xc000dc2c80, 0x25?, 0xc02411e918)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:178 +0x15f
github.com/cockroachdb/pebble.glob..func17.1(0xc0204d9d40)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/format_major_version.go:471 +0x265
github.com/cockroachdb/pebble.(*DB).markFilesLocked.func1(0x0?, 0xc00702f850?, 0xc02411ecc0?, 0xc02411eca7, 0xc02411ee00, 0xc02411ecd0)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/format_major_version.go:509 +0x11e
github.com/cockroachdb/pebble.(*DB).markFilesLocked(0xc002cf1680, 0x9?)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/format_major_version.go:510 +0xb5
github.com/cockroachdb/pebble.glob..func14(0xc002cf1680)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/format_major_version.go:264 +0x3c
github.com/cockroachdb/pebble.(*DB).ratchetFormatMajorVersionLocked(0xc002cf1680, 0xa)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/format_major_version.go:351 +0x27c
github.com/cockroachdb/pebble.(*DB).RatchetFormatMajorVersion(0x9646f40?, 0xc000782360?)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/format_major_version.go:329 +0xec
github.com/cockroachdb/cockroach/pkg/storage.(*Pebble).SetMinVersion(0xc000fb0160, {0x0?, 0x0?, 0x0?, 0x0?})
	github.com/cockroachdb/cockroach/pkg/storage/pebble.go:1816 +0x207
github.com/cockroachdb/cockroach/pkg/kv/kvserver.WriteClusterVersion({0x680b5c8, 0xc017acf770}, {0x68a7188, 0xc000fb0160}, {{0xf4256, 0x1, 0x0, 0x30}})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3725 +0x2c5
github.com/cockroachdb/cockroach/pkg/kv/kvserver.WriteClusterVersionToEngines({0x680b5c8, 0xc017acf770}, {0xc003795860?, 0x1, 0xc003795860?}, {{0x1?, 0x0?, 0x0?, 0x0?}})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:337 +0xe5
github.com/cockroachdb/cockroach/pkg/server.bumpClusterVersion({0x680b5c8, 0xc017acf770}, 0xc001216000, {{0x129c5398?, 0xc0?, 0x1018b46?, 0x0?}}, {0xc003795860, 0x1, 0x1})
	github.com/cockroachdb/cockroach/pkg/server/migration.go:133 +0x198
github.com/cockroachdb/cockroach/pkg/server.(*migrationServer).BumpClusterVersion.func1({0x680b5c8?, 0xc017acf770?})
	github.com/cockroachdb/cockroach/pkg/server/migration.go:101 +0x12c
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc002caa990, {0x680b5c8, 0xc017acf770}, {0x14?, 0x0?}, 0xc0129c54c8)
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:341 +0xd1
github.com/cockroachdb/cockroach/pkg/server.(*migrationServer).BumpClusterVersion(0xc006f20620, {0x680b5c8?, 0xc017acf6b0?}, 0xc0268f2c00)
	github.com/cockroachdb/cockroach/pkg/server/migration.go:96 +0x125
github.com/cockroachdb/cockroach/pkg/server/serverpb._Migration_BumpClusterVersion_Handler.func1({0x680b5c8, 0xc017acf6b0}, {0x5158640?, 0xc0268f2c00})
	github.com/cockroachdb/cockroach/pkg/server/serverpb/bazel-out/k8-fastbuild/bin/pkg/server/serverpb/serverpb_go_proto_/github.com/cockroachdb/cockroach/pkg/server/serverpb/migration.pb.go:593 +0x78
github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor.ServerInterceptor.func1({0x680b5c8, 0xc017acf650}, {0x5158640, 0xc0268f2c00}, 0xc007cdb2a0, 0xc00840a7e0)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:114 +0x66a
google.golang.org/grpc.chainUnaryInterceptors.func1.1({0x680b5c8?, 0xc017acf650?}, {0x5158640?, 0xc0268f2c00?})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1117 +0x5b
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func3({0x680b5c8, 0xc017acf650}, {0x5158640, 0xc0268f2c00}, 0xc000ad3920?, 0xc01eb8c7c0)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:263 +0x83
google.golang.org/grpc.chainUnaryInterceptors.func1.1({0x680b5c8?, 0xc017acf650?}, {0x5158640?, 0xc0268f2c00?})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1120 +0x83
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1.1({0x680b5c8?, 0xc017acf650?})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:232 +0x39
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc002caa990, {0x680b5c8, 0xc017acf650}, {0x203001?, 0x40?}, 0xc000ad39e8)
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:341 +0xd1
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1({0x680b5c8?, 0xc017acf650?}, {0x5158640?, 0xc0268f2c00?}, 0x5025f20?, 0x7f80677d1001?)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:230 +0x95
google.golang.org/grpc.chainUnaryInterceptors.func1.1({0x680b5c8?, 0xc017acf650?}, {0x5158640?, 0xc0268f2c00?})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1120 +0x83
google.golang.org/grpc.chainUnaryInterceptors.func1({0x680b5c8, 0xc017acf650}, {0x5158640, 0xc0268f2c00}, 0xc007cdb2a0, 0xc00840a7e0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1122 +0x12b
github.com/cockroachdb/cockroach/pkg/server/serverpb._Migration_BumpClusterVersion_Handler({0x50d3d20?, 0xc006f20620}, {0x680b5c8, 0xc017acf650}, 0xc025527c20, 0xc000475fa0)
	github.com/cockroachdb/cockroach/pkg/server/serverpb/bazel-out/k8-fastbuild/bin/pkg/server/serverpb/serverpb_go_proto_/github.com/cockroachdb/cockroach/pkg/server/serverpb/migration.pb.go:595 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc002b45340, {0x6854320, 0xc0112a2820}, 0xc02a70d680, 0xc006ef8ed0, 0x95a5338, 0x0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1283 +0xcfe
google.golang.org/grpc.(*Server).handleStream(0xc002b45340, {0x6854320, 0xc0112a2820}, 0xc02a70d680, 0x0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1620 +0xa2f
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:922 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:920 +0x28a

CRDB test failures:

Notes

  • This failure seems to only happen in "large" (relative to other tests) databases (GBs of disk usage). Other upgrade tests have not exposed this crash.
  • The failure is non-deterministic, happening with ~5% probability in the tpcc/mixed-headroom/n5cpu16 as is on master
  • Letting the background tpcc workload to ramp up before attempting a cluster version upgrade increases the probability (I don't have a good estimate for the probability of failure, but I did notice it happens more frequently).
@jbowens jbowens changed the title 'file not found' crash while upgrading from crdb 22.1 to 22.2 db: FormatPrePebblev1Marked panics if sstable does not exist Oct 20, 2022
jbowens added a commit to jbowens/pebble that referenced this issue Oct 20, 2022
Fix the FormatPrePebblev1Marked migration to tolerate concurrent file
deletions by disabling physical deletion of files removed from the LSM
until the migration completes.

Fix cockroachdb#2019.
Informs cockroachdb/cockroach#89755.
Informs cockroachdb/cockroach#83079.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 20, 2022
Fix the FormatPrePebblev1Marked migration to tolerate concurrent file
deletions by disabling physical deletion of files removed from the LSM
until the migration completes.

Fix cockroachdb#2019.
Informs cockroachdb/cockroach#89755.
Informs cockroachdb/cockroach#83079.
jbowens added a commit that referenced this issue Oct 20, 2022
Fix the FormatPrePebblev1Marked migration to tolerate concurrent file
deletions by disabling physical deletion of files removed from the LSM
until the migration completes.

Fix #2019.
Informs cockroachdb/cockroach#89755.
Informs cockroachdb/cockroach#83079.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 20, 2022
Fix the FormatPrePebblev1Marked migration to tolerate concurrent file
deletions by disabling physical deletion of files removed from the LSM
until the migration completes.

Fix cockroachdb#2019.
Informs cockroachdb/cockroach#89755.
Informs cockroachdb/cockroach#83079.
jbowens added a commit that referenced this issue Oct 20, 2022
Fix the FormatPrePebblev1Marked migration to tolerate concurrent file
deletions by disabling physical deletion of files removed from the LSM
until the migration completes.

Fix #2019.
Informs cockroachdb/cockroach#89755.
Informs cockroachdb/cockroach#83079.
@joshimhoff
Copy link
Contributor

joshimhoff commented Oct 21, 2022

Convo at https://cockroachlabs.slack.com/archives/C01CDD4HRC5/p1666329387677329?thread_ts=1664295784.890119&cid=C01CDD4HRC5 about how there was nothing in logs about the crash reason, requiring @renatolabs to do this:

To get the stack trace, I compiled cockroach with a patched version of the stdlib that dumps a stack trace on os.Exit,

We will open some follow up issues soon. It seems there are naked calls to os.Exit in the pebble repo that may lead to no crash reason in the logs due to process exit happening before logs are flushed, or something like that. Also the linter we have set up on the CRDB code to protect against bugs of this kind doesn't run on the pebble repo.

@joshimhoff
Copy link
Contributor

Details given above not right exactly but there is certainly a bug of some kind given lack of crash reason in logs. See #2039 for current understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants