-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: decommission/mixed-versions failed #58523
Comments
(roachtest).decommission/mixed-versions failed on master@dbc7245c5d8c9f009072353fec261419e573032c:
More
Artifacts: /decommission/mixed-versions
See this test on roachdash |
@irfansharif it looks like it might be connected to #56480 (or some of its preceding changes). However, the timing of these failures doesn't entirely line up with when that change landed. I ran a couple iterations of this on master but could not reproduce. I am running a few iterations of this on that change's SHA, will post here again. LMK if you already know something about this |
|
Thanks for looking, Irfan. |
I've been starting at it this morning, and I'm pretty sure it's another manifestation of #58378. In #58378 (comment) we observed a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized", which happens between the following two points (notice the locking, or the lack there of, between these points):
cockroach/pkg/kv/kvserver/replica_raftstorage.go Lines 1001 to 1005 in ce4b23d
What this means is that very temporarily, it's possible for the entry in cockroach/pkg/kv/kvserver/replica.go Lines 801 to 806 in 690c122
Which is being driven by the migrations infrastructure here: cockroach/pkg/kv/kvserver/store.go Lines 2808 to 2809 in 690c122
Looking at our replica iterator, we filter out uninitialized replicas here: cockroach/pkg/kv/kvserver/store.go Line 396 in 690c122
But because it's still possible for an "initialized" replica to exist in our store map without a real ReplicaState, we're likely to run into the panic above. Seeing as how @tbg is already planning to tackle #58378, I'll punt it off. |
There's a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this means is that very temporarily, it's possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This was an existing problem, but is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). This issue has caused a bit of recent instability: cockroachdb#59180, cockroachdb#58489, \cockroachdb#58523, and cockroachdb#58378. While we work on a more considered fix to the problem (tracked in cockroachdb#58489), we can introduce stop the bleeding in the interim (and unskip some tests). Release note: None
There's a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this means is that very temporarily, it's possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This was an existing problem, but is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). This issue has caused a bit of recent instability: cockroachdb#59180, cockroachdb#58489, \cockroachdb#58523, and cockroachdb#58378. While we work on a more considered fix to the problem (tracked in cockroachdb#58489), we can introduce stop the bleeding in the interim (and unskip some tests). Release note: None
59194: kv: introduce a stopgap for lack of ReplicaState synchronization r=irfansharif a=irfansharif There's a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this means is that very temporarily, it's possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This was an existing problem, but is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). This issue has caused a bit of recent instability: #59180, #58489, \#58523, and #58378. While we work on a more considered fix to the problem (tracked in #58489), we can introduce stop the bleeding in the interim (and unskip some tests). Release note: None 59201: sql: add telemetry for materialized views and set schema. r=otan a=RichardJCai sql: add telemetry for materialized views and set schema. Release note: None Resolves #57299 Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: richardjcai <caioftherichard@gmail.com>
this is not a release blocker if you have #59194. |
There's a scary lack of synchronization around how we set the ReplicaState for a given replica, and how we mark a replica as "initialized". What this means is that very temporarily, it's possible for the entry in Store.mu.replicas to be both "initialized" and have an empty ReplicaState. This was an existing problem, but is now more likely to bite us given the migrations infrastructure attempts to purge outdated replicas at start up time (when replicas are being initialized, and we're iterating through extan replicas in the Store.mu.replicas map). This issue has caused a bit of recent instability: cockroachdb#59180, cockroachdb#58489, \cockroachdb#58523, and cockroachdb#58378. While we work on a more considered fix to the problem (tracked in cockroachdb#58489), we can introduce stop the bleeding in the interim (and unskip some tests). Release note: None
At this stage, I think the alpha will definitely include #59194, so removing the release blocker label. |
(roachtest).decommission/mixed-versions failed on master@15765c0fa9118885dda0bd2ad1384b8801c412c3:
More
Artifacts: /decommission/mixed-versions
Related:
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: