Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore2TB/nodes=10 failed #60479

Closed
cockroach-teamcity opened this issue Feb 11, 2021 · 3 comments · Fixed by #60504
Closed

roachtest: restore2TB/nodes=10 failed #60479

cockroach-teamcity opened this issue Feb 11, 2021 · 3 comments · Fixed by #60504
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).restore2TB/nodes=10 failed on master@57de93a997db31d97cd7e23fb866fe0e44f90de2:

		  | rax    0xca
		  | rbx    0xc00049c000
		  | rcx    0x4b9fc3
		  | rdx    0x0
		  | rdi    0xc00049c148
		  | rsi    0x80
		  | rbp    0x7f710a5fbc68
		  | rsp    0x7f710a5fbc20
		  | r8     0x0
		  | r9     0x0
		  | r10    0x0
		  | r11    0x286
		  | r12    0x3
		  | r13    0xc000582600
		  | r14    0x0
		  | r15    0x0
		  | rip    0x4b9fc1
		  | rflags 0x286
		  | cs     0x33
		  | fs     0x0
		  | gs     0x0
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | signal: killed
		  | (1) signal: killed
		  | Error types: (1) *exec.ExitError
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	cluster.go:2687,restore.go:264,test_runner.go:767: monitor failure: monitor task failed: context canceled
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2675
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2683
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:264
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2731
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (4) monitor task failed
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

More

Artifacts: /restore2TB/nodes=10

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 11, 2021
@dt
Copy link
Member

dt commented Feb 11, 2021

Per @nvanbenschoten looks like deadlock in kv, potentially related to #59423 or #59086. Additional context in slack here: https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1613076549170200?thread_ts=1613072464.164800&cid=C0KB9Q03D

@nvanbenschoten
Copy link
Member

Concretely, the issue here is ad78116, which introduced a lock ordering bug. The locking notes here indicate that the store mutex should be locked before the replica mutex, but that commit is doing the opposite.

We see the resulting deadlock with these two goroutines:

1: semacquire [115 minutes] [Created by stop.(*Stopper).RunAsyncTask @ stopper.go:347]
    sync            sema.go:71                                runtime_SemacquireMutex(*uint32(#1078), bool(#245))
    sync            mutex.go:138                              (*Mutex).lockSlow(*Mutex(#1077))
    sync            mutex.go:81                               (*Mutex).Lock(...)
    sync            rwmutex.go:98                             (*RWMutex).Lock(*RWMutex(#1077))
    kvserver        replica_raftstorage.go:973                (*Replica).applySnapshot(*Replica(#5940), Context(#199), IncomingSnapshot(0x8047336de5c0238f), Snapshot(#6688), HardState(0x0), []*Replica(#6502 len=0 cap=0), ...)
    kvserver        replica_raft.go:544                       (*Replica).handleRaftReadyRaftMuLocked(*Replica(#5940), Context(#199), IncomingSnapshot(0x8047336de5c0238f), #6688, 0x0, 0x0, 0x0, #6502, ...)
    kvserver        store_raft.go:358                         (*Store).processRaftSnapshotRequest.func1(*Store(#199), Context(#6906), *SnapshotRequest_Header(0x0))
    kvserver        store_raft.go:219                         (*Store).withReplicaForRequest(*Store(#1075), Context(#199), *RaftMessageRequest(#6501), func(#6131))
    kvserver        store_raft.go:280                         (*Store).processRaftSnapshotRequest(*Store(#1075), Context(#199), *SnapshotRequest_Header(#6500), IncomingSnapshot(0x8047336de5c0238f), #6688, 0x0, 0x0, 0x0, ...)
    kvserver        store_snapshot.go:812                     (*Store).receiveSnapshot(*Store(#1075), Context(#199), *SnapshotRequest_Header(#6500), incomingSnapshotStream(#11155), 0x0, 0x0)
    kvserver        store_raft.go:83                          (*Store).HandleSnapshot.func1(*Store(#199), *SnapshotRequest_Header(#6906), SnapshotResponseStream(#301))
    stop            stopper.go:331                            (*Stopper).RunTaskWithErr(*Stopper(#545), Context(#199), string(#144, len=30), func(#4185), 0x0)
    kvserver        store_raft.go:73                          (*Store).HandleSnapshot(*Store(#1075), *SnapshotRequest_Header(#6500), SnapshotResponseStream(#11154), #6090, #11195)
    kvserver        raft_transport.go:412                     (*RaftTransport).RaftSnapshot.func1.1(*RaftTransport(#213), MultiRaft_RaftSnapshotServer(#6090), #199, #6389, #6765, 0x0)
    kvserver        raft_transport.go:413                     (*RaftTransport).RaftSnapshot.func1(*RaftTransport(#199), MultiRaft_RaftSnapshotServer(#6389))
    stop            stopper.go:352                            (*Stopper).RunAsyncTask.func1(#545, #199, #6389, 0, #6905)

5: semacquire [115 minutes] [Created by stop.(*Stopper).RunAsyncTask @ stopper.go:347]
    sync            sema.go:71                                runtime_SemacquireMutex(#5943, #74, 0)
    sync            rwmutex.go:50                             (*RWMutex).RLock(...)
    kvserver        replica.go:715                            (*Replica).Desc(#5940, 0)
    kvserver        store_replica_btree.go:71                 (*storeReplicaBTree).LookupReplica(#1806, #197, #533, *, *, 8, 0)
    kvserver        store.go:2178                             (*Store).LookupReplica(#1075, *, *, 8, 0)
    kvserver        split_trigger_helper.go:36                (*replicaMsgAppDropper).ShouldDrop(*, *, *, 8, 0, *, *)
    kvserver        split_trigger_helper.go:133               maybeDropMsgApp(#199, *, #185, *, *, *, *, 8, *)
    kvserver        store_raft.go:259                         (*Store).processRaftRequestWithReplica(#1075, #199, *, *, *, #71)
    kvserver        store_raft.go:476                         (*Store).processRequestQueue.func1(#199, #1952, *, *)
    kvserver        store_raft.go:219                         (*Store).withReplicaForRequest(#1075, #199, #1952, *, *, 0)
    kvserver        store_raft.go:473                         (*Store).processRequestQueue(#1075, #199, #1952, *, 0)
    kvserver        scheduler.go:272                          (*raftScheduler).worker(#2244, #199, #1952)
    stop            stopper.go:352                            (*Stopper).RunAsyncTask.func1(#545, #199, #1952, #836, *)

The first goroutine locks:

r.mu.Lock()
r.store.mu.Lock()

The second goroutine locks:

s.mu.RLock()
r.mu.RLock()

I'm drafting a fix now.

@cockroach-teamcity
Copy link
Member Author

(roachtest).restore2TB/nodes=10 failed on master@9969862335cf501c1aa67a4aaf018ad5dc8e85e8:

		  | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  | 				WITH into_db = 'restore2tb'" returned
		  | stderr:
		  | ERROR: splitting key /Table/53/1/27992857: invalid lease
		  | Failed running "sql"
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 1. Command with error:
		  |   | ```
		  |   | ./cockroach sql --insecure -e "
		  |   | 				RESTORE csv.bank FROM
		  |   | 				'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=10240,ranges=0,rows=65104166,seed=1/bank'
		  |   | 				WITH into_db = 'restore2tb'"
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (4) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *exec.ExitError

	cluster.go:2687,restore.go:264,test_runner.go:767: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2675
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2683
		  | main.registerRestore.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/restore.go:264
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2731
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2645
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:5652
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:191
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

More

Artifacts: /restore2TB/nodes=10

See this test on roachdash
powered by pkg/cmd/internal/issues

craig bot pushed a commit that referenced this issue Feb 16, 2021
60504: kv: fix lock ordering in Replica.applySnapshot r=tbg a=nvanbenschoten

Fixes #60479.

ad78116 was a nice improvement that avoided a number of tricky questions about exposing an inconsistent in-memory replica state during a snapshot. However, it appears to have introduced a deadlock due to lock ordering issues (see referenced issue).

This commit fixes that issue by locking the Store mutex before the Replica mutex in `Replica.applySnapshot`'s combined critical section.

Release note: None


Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in 2e71a2b Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants