-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: version/mixed/nodes=3 failed #38560
Labels
C-test-failure
Broken test (automatically or manually discovered).
O-roachtest
O-robot
Originated from a bot.
Milestone
Comments
cockroach-teamcity
added
C-test-failure
Broken test (automatically or manually discovered).
O-roachtest
O-robot
Originated from a bot.
labels
Jun 28, 2019
Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog
|
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 3, 2019
Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A for of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 3, 2019
Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None
craig bot
pushed a commit
that referenced
this issue
Jul 3, 2019
38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten Fixes #34180. Fixes #35493. Fixes #36983. Fixes #37108. Fixes #37371. Fixes #37384. Fixes #37551. Fixes #37879. Fixes #38095. Fixes #38131. Fixes #38136. Fixes #38549. Fixes #38552. Fixes #38555. Fixes #38560. Fixes #38562. Fixes #38563. Fixes #38569. Fixes #38578. Fixes #38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: ![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png) We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-test-failure
Broken test (automatically or manually discovered).
O-roachtest
O-robot
Originated from a bot.
SHA: https://github.com/cockroachdb/cockroach/commits/90841a6559df9d9a4724e1d30490951bbdb811b4
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364443&tab=buildLog
The text was updated successfully, but these errors were encountered: