-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
circleci: failed tests: TestStoreMetrics #7678
Labels
Milestone
Comments
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jul 13, 2016
This was a tough one. Several problems were addressed, all variations on the same theme: - DistSenders in multiTestContext use a shared global stopper, but they may be called on goroutines which belong to a Store-level task. If that Store wants to quiesce and the DistSender can't finish its task because that same Store is already in quiescing mode, deadlocks occurred. The unfortunate solution is plugging in a channel which draws from two Stoppers, one of which may be quiesced and replaced multiple times. - Additional deadlocks were caused due to multiTestContext's transport, which acquired a read lock that was formerly held in write mode throughout mtc.stopStore() (circumvented by dropping the lock there while quiescing). - verifyStats was stopping individual Stores to perform computations without moving parts. Stopping individual Stores is tough when their tasks may be stuck on other Stores but can't complete while their own Store is already quiescing. Instead, verifyStats stops *all stores* simultaneously, regardless of which Store is actively being investigated. Prior to these changes, failed in a few hundred to a few thousand iters (depending on how many of the above were partially addressed): ``` $ make stressrace PKG=./storage TESTS=TestStoreMetrics TESTTIMEOUT=10s STRESSFLAGS='-maxfails 1 -stderr -p 128 -timeout 15m' 15784 runs so far, 0 failures, over 8m0s ``` Fixes cockroachdb#7678.
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jul 13, 2016
This was a tough one. Several problems were addressed, all variations on the same theme: - DistSenders in multiTestContext use a shared global stopper, but they may be called on goroutines which belong to a Store-level task. If that Store wants to quiesce and the DistSender can't finish its task because that same Store is already in quiescing mode, deadlocks occurred. The unfortunate solution is plugging in a channel which draws from two Stoppers, one of which may be quiesced and replaced multiple times. - Additional deadlocks were caused due to multiTestContext's transport, which acquired a read lock that was formerly held in write mode throughout mtc.stopStore() (circumvented by dropping the lock there while quiescing). - verifyStats was stopping individual Stores to perform computations without moving parts. Stopping individual Stores is tough when their tasks may be stuck on other Stores but can't complete while their own Store is already quiescing. Instead, verifyStats stops *all stores* simultaneously, regardless of which Store is actively being investigated. Prior to these changes, failed in a few hundred to a few thousand iters (depending on how many of the above were partially addressed): ``` $ make stressrace PKG=./storage TESTS=TestStoreMetrics TESTTIMEOUT=10s STRESSFLAGS='-maxfails 1 -stderr -p 128 -timeout 15m' 15784 runs so far, 0 failures, over 8m0s ``` Fixes cockroachdb#7678.
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jul 14, 2016
This was a tough one. Several problems were addressed, all variations on the same theme: - DistSenders in multiTestContext use a shared global stopper, but they may be called on goroutines which belong to a Store-level task. If that Store wants to quiesce and the DistSender can't finish its task because that same Store is already in quiescing mode, deadlocks occurred. The unfortunate solution is plugging in a channel which draws from two Stoppers, one of which may be quiesced and replaced multiple times. - Additional deadlocks were caused due to multiTestContext's transport, which acquired a read lock that was formerly held in write mode throughout mtc.stopStore() (circumvented by dropping the lock there while quiescing). - verifyStats was stopping individual Stores to perform computations without moving parts. Stopping individual Stores is tough when their tasks may be stuck on other Stores but can't complete while their own Store is already quiescing. Instead, verifyStats stops *all stores* simultaneously, regardless of which Store is actively being investigated. Prior to these changes, failed in a few hundred to a few thousand iters (depending on how many of the above were partially addressed): ``` $ make stressrace PKG=./storage TESTS=TestStoreMetrics TESTTIMEOUT=10s STRESSFLAGS='-maxfails 1 -stderr -p 128 -timeout 15m' 15784 runs so far, 0 failures, over 8m0s ``` Fixes cockroachdb#7678.
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jul 14, 2016
This was a tough one. Several problems were addressed, all variations on the same theme: - DistSenders in multiTestContext use a shared global stopper, but they may be called on goroutines which belong to a Store-level task. If that Store wants to quiesce and the DistSender can't finish its task because that same Store is already in quiescing mode, deadlocks occurred. The unfortunate solution is plugging in a channel which draws from two Stoppers, one of which may be quiesced and replaced multiple times. - Additional deadlocks were caused due to multiTestContext's transport, which acquired a read lock that was formerly held in write mode throughout mtc.stopStore() (circumvented by dropping the lock there while quiescing). - verifyStats was stopping individual Stores to perform computations without moving parts. Stopping individual Stores is tough when their tasks may be stuck on other Stores but can't complete while their own Store is already quiescing. Instead, verifyStats stops *all stores* simultaneously, regardless of which Store is actively being investigated. Prior to these changes, failed in a few hundred to a few thousand iters (depending on how many of the above were partially addressed): ``` $ make stressrace PKG=./storage TESTS=TestStoreMetrics TESTTIMEOUT=10s STRESSFLAGS='-maxfails 1 -stderr -p 128 -timeout 15m' 15784 runs so far, 0 failures, over 8m0s ``` Fixes cockroachdb#7678.
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jul 14, 2016
This was a tough one. Several problems were addressed, all variations on the same theme: - DistSenders in multiTestContext use a shared global stopper, but they may be called on goroutines which belong to a Store-level task. If that Store wants to quiesce and the DistSender can't finish its task because that same Store is already in quiescing mode, deadlocks occurred. The unfortunate solution is plugging in a channel which draws from two Stoppers, one of which may be quiesced and replaced multiple times. - Additional deadlocks were caused due to multiTestContext's transport, which acquired a read lock that was formerly held in write mode throughout mtc.stopStore() (circumvented by dropping the lock there while quiescing). - verifyStats was stopping individual Stores to perform computations without moving parts. Stopping individual Stores is tough when their tasks may be stuck on other Stores but can't complete while their own Store is already quiescing. Instead, verifyStats stops *all stores* simultaneously, regardless of which Store is actively being investigated. Prior to these changes, failed in a few hundred to a few thousand iters (depending on how many of the above were partially addressed): ``` $ make stressrace PKG=./storage TESTS=TestStoreMetrics TESTTIMEOUT=10s STRESSFLAGS='-maxfails 1 -stderr -p 128 -timeout 15m' 15784 runs so far, 0 failures, over 8m0s ``` Fixes cockroachdb#7678.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
The following test appears to have failed:
#19981:
Please assign, take a look and update the issue accordingly.
The text was updated successfully, but these errors were encountered: