Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: decommission/mixed-versions failed #58523

Closed
cockroach-teamcity opened this issue Jan 7, 2021 · 7 comments
Closed

roachtest: decommission/mixed-versions failed #58523

cockroach-teamcity opened this issue Jan 7, 2021 · 7 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).decommission/mixed-versions failed on master@15765c0fa9118885dda0bd2ad1384b8801c412c3:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/mixed-versions/run_1
	versionupgrade.go:281,versionupgrade.go:386,retry.go:197,versionupgrade.go:385,versionupgrade.go:189,mixed_version_decommission.go:96,decommission.go:73,test_runner.go:760: pq: operation "show cluster setting version" timed out after 2m0s: value differs between local setting ([18 8 8 20 16 2 24 0 32 14]) and KV ([18 8 8 20 16 2 24 0 32 0]); try again later (<nil> after 1m58.079161845s)

	cluster.go:1637,context.go:140,cluster.go:1626,test_runner.go:841: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2564046-1610002658-09-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 1: 5596
		3: 5159
		4: 4908
		2: dead
		Error: UNCLASSIFIED_PROBLEM: 2: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 2: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /decommission/mixed-versions
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 7, 2021
@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/mixed-versions failed on master@dbc7245c5d8c9f009072353fec261419e573032c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/mixed-versions/run_1
	versionupgrade.go:281,versionupgrade.go:386,retry.go:197,versionupgrade.go:385,versionupgrade.go:189,mixed_version_decommission.go:96,decommission.go:73,test_runner.go:760: pq: operation "show cluster setting version" timed out after 2m0s: value differs between local setting ([18 8 8 20 16 2 24 0 32 14]) and KV ([18 8 8 20 16 2 24 0 32 0]); try again later (<nil> after 1m58.285677153s)

	cluster.go:1637,context.go:140,cluster.go:1626,test_runner.go:841: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2569739-1610175644-12-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: 5550
		1: 5829
		4: 5423
		3: dead
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /decommission/mixed-versions
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@aayushshah15
Copy link
Contributor

@irfansharif it looks like it might be connected to #56480 (or some of its preceding changes). However, the timing of these failures doesn't entirely line up with when that change landed. I ran a couple iterations of this on master but could not reproduce. I am running a few iterations of this on that change's SHA, will post here again. LMK if you already know something about this value differs between local setting and KV failure mode.

@tbg
Copy link
Member

tbg commented Jan 12, 2021

I210109 07:04:39.282483 1 util/log/flags.go:194  stderr capture started
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1d68c92]

goroutine 806 [running]:
panic(0x424ff60, 0x76ca030)
	/usr/local/go/src/runtime/panic.go:1064 +0x545 fp=0xc00240d4b8 sp=0xc00240d3f0 pc=0x4e1b25
runtime.panicmem(...)
	/usr/local/go/src/runtime/panic.go:212
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:742 +0x413 fp=0xc00240d4e8 sp=0xc00240d4b8 pc=0x4f8873
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Version(0xc0011d8000, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica.go:805 +0x72 fp=0xc00240d528 sp=0xc00240d4e8 pc=0x1d68c92
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).PurgeOutdatedReplicas.func1(0xc0011d8000, 0xc00203c301)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:2814 +0x8c fp=0xc00240d5c8 sp=0xc00240d528 pc=0x1e34c0c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*storeReplicaVisitor).Visit(0xc00203c330, 0xc00240d688)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:396 +0x151 fp=0xc00240d630 sp=0xc00240d5c8 pc=0x1de9351
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).VisitReplicas(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:2013
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).PurgeOutdatedReplicas(0xc000369c00, 0x55d51e0, 0xc00203c2a0, 0x200000014, 0xe00000000, 0xc0004fec68, 0xc000d45a40)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:2813 +0x1bf fp=0xc00240d6e8 sp=0xc00240d630 pc=0x1df53ff
github.com/cockroachdb/cockroach/pkg/server.(*migrationServer).PurgeOutdatedReplicas.func1(0xc000369c00, 0x4bad45, 0xc00240d788)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/migration.go:184 +0x65 fp=0xc00240d730 sp=0xc00240d6e8 pc=0x3938c25
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1(0x3, 0xc000369c00, 0xc00240d788)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:131 +0x38 fp=0xc00240d760 sp=0xc00240d730 pc=0x1e38718
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range(0xc000bc4570, 0xc00240d818)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:352 +0x130 fp=0xc00240d7f8 sp=0xc00240d760 pc=0x731bf0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores(0xc000bc4540, 0xc00240d888, 0x48f67b0, 0x17)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:130 +0x75 fp=0xc00240d840 sp=0xc00240d7f8 pc=0x1e16c75
github.com/cockroachdb/cockroach/pkg/server.(*migrationServer).PurgeOutdatedReplicas(0xc0004ed6c0, 0x55d51e0, 0xc00203c1e0, 0xc001d60308, 0x0, 0x0, 0x0)

@tbg
Copy link
Member

tbg commented Jan 12, 2021

Thanks for looking, Irfan.

@irfansharif
Copy link
Contributor

irfansharif commented Jan 14, 2021

I've been starting at it this morning, and I'm pretty sure it's another manifestation of #58378. In #58378 (comment) we observed a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized", which happens between the following two points (notice the locking, or the lack there of, between these points):

r.setDescRaftMuLocked(ctx, s.Desc)

// Update the rest of the Raft state. Changes to r.mu.state.Desc must be
// managed by r.setDescRaftMuLocked and changes to r.mu.state.Lease must be handled
// by r.leasePostApply, but we called those above, so now it's safe to
// wholesale replace r.mu.state.
r.mu.state = s

What this means is that very temporarily, it's possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. In the panic above, we're running into a nil pointer here:

// Version returns the replica version.
func (r *Replica) Version() roachpb.Version {
r.mu.RLock()
defer r.mu.RUnlock()
return *r.mu.state.Version
}

Which is being driven by the migrations infrastructure here:

s.VisitReplicas(func(repl *Replica) (wantMore bool) {
if !repl.Version().Less(version) {

Looking at our replica iterator, we filter out uninitialized replicas here:

if initialized && destroyed.IsAlive() && !visitor(repl) {

But because it's still possible for an "initialized" replica to exist in our store map without a real ReplicaState, we're likely to run into the panic above. Seeing as how @tbg is already planning to tackle #58378, I'll punt it off.

@irfansharif irfansharif assigned tbg and unassigned irfansharif Jan 14, 2021
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 20, 2021
There's a scary lack of synchronization around how we set the
ReplicaState for a given replica, and how we mark a replica as
"initialized". What this means is that very temporarily, it's possible
for the entry in Store.mu.replicas to be both "initialized" and have an
empty ReplicaState. This was an existing problem, but is now more likely
to bite us given the migrations infrastructure attempts to purge
outdated replicas at start up time (when replicas are being initialized,
and we're iterating through extan replicas in the Store.mu.replicas
map).

This issue has caused a bit of recent instability: cockroachdb#59180, cockroachdb#58489,
\cockroachdb#58523, and cockroachdb#58378. While we work on a more considered fix to the
problem (tracked in cockroachdb#58489), we can introduce stop the bleeding in the
interim (and unskip some tests).

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 20, 2021
There's a scary lack of synchronization around how we set the
ReplicaState for a given replica, and how we mark a replica as
"initialized". What this means is that very temporarily, it's possible
for the entry in Store.mu.replicas to be both "initialized" and have an
empty ReplicaState. This was an existing problem, but is now more likely
to bite us given the migrations infrastructure attempts to purge
outdated replicas at start up time (when replicas are being initialized,
and we're iterating through extan replicas in the Store.mu.replicas
map).

This issue has caused a bit of recent instability: cockroachdb#59180, cockroachdb#58489,
\cockroachdb#58523, and cockroachdb#58378. While we work on a more considered fix to the
problem (tracked in cockroachdb#58489), we can introduce stop the bleeding in the
interim (and unskip some tests).

Release note: None
craig bot pushed a commit that referenced this issue Jan 20, 2021
59194: kv: introduce a stopgap for lack of ReplicaState synchronization r=irfansharif a=irfansharif

There's a scary lack of synchronization around how we set the
ReplicaState for a given replica, and how we mark a replica as
"initialized". What this means is that very temporarily, it's possible
for the entry in Store.mu.replicas to be both "initialized" and have an
empty ReplicaState. This was an existing problem, but is now more likely
to bite us given the migrations infrastructure attempts to purge
outdated replicas at start up time (when replicas are being initialized,
and we're iterating through extan replicas in the Store.mu.replicas
map).

This issue has caused a bit of recent instability: #59180, #58489,
\#58523, and #58378. While we work on a more considered fix to the
problem (tracked in #58489), we can introduce stop the bleeding in the
interim (and unskip some tests).

Release note: None

59201:  sql: add telemetry for materialized views and set schema. r=otan a=RichardJCai

 sql: add telemetry for materialized views and set schema.

Release note: None

Resolves #57299 

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: richardjcai <caioftherichard@gmail.com>
@tbg
Copy link
Member

tbg commented Jan 21, 2021

this is not a release blocker if you have #59194.

pbardea pushed a commit to pbardea/cockroach that referenced this issue Jan 21, 2021
There's a scary lack of synchronization around how we set the
ReplicaState for a given replica, and how we mark a replica as
"initialized". What this means is that very temporarily, it's possible
for the entry in Store.mu.replicas to be both "initialized" and have an
empty ReplicaState. This was an existing problem, but is now more likely
to bite us given the migrations infrastructure attempts to purge
outdated replicas at start up time (when replicas are being initialized,
and we're iterating through extan replicas in the Store.mu.replicas
map).

This issue has caused a bit of recent instability: cockroachdb#59180, cockroachdb#58489,
\cockroachdb#58523, and cockroachdb#58378. While we work on a more considered fix to the
problem (tracked in cockroachdb#58489), we can introduce stop the bleeding in the
interim (and unskip some tests).

Release note: None
@asubiotto
Copy link
Contributor

At this stage, I think the alpha will definitely include #59194, so removing the release blocker label.

@asubiotto asubiotto removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jan 25, 2021
@tbg tbg closed this as completed Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants