-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: more aggressively replica GC learner replicas #41300
storage: more aggressively replica GC learner replicas #41300
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and thanks for looking into this!
Reviewed 4 of 4 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/storage/replica_gc_queue.go, line 39 at r1 (raw file):
// A Replica is suspected to have been removed if either it is in the // candidate Raft state (which is a typical sign of having been removed // from the group) or it is a leaner which has been removed but never heard
learner
This PR fixes a test flake in TestSystemZoneConfig: ``` client_replica_test.go:1753: condition failed to evaluate within 45s: mismatch between r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):7, (n7,s7):5, next=8, gen=14] r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):4, (n7,s7):5, (n3,s3):6LEARNER, next=7, gen=9] ``` The above flake happens because we set the expectation in the map to a descriptor which contains a learner which has since been removed. We shouldn't use a range descriptor which contains learners as the expectation. To avoid that we return an error in the succeeds soon if we come across a descriptor which contains learners. This behavior unvealed another issue, we are way too conservative with replica GC for learners. Most of the time when learners are removed they hear about their own removal, but if they don't we won't consider the Replica for removal for 10 days! This commit changes the replica gc queue behavior to treat learners line candidates. Fixes cockroachdb#40980. Release Justification: bug fixes and low-risk updates to new functionality. Release note: None
7376180
to
2e9ce1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r=tbg
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)
pkg/storage/replica_gc_queue.go, line 39 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
learner
Done.
Build failed |
Flaked on TestComposeGSS. bors r+ |
Build failed |
bors r+ |
41300: storage: more aggressively replica GC learner replicas r=ajwerner a=ajwerner This PR fixes a test flake in TestSystemZoneConfig: ``` client_replica_test.go:1753: condition failed to evaluate within 45s: mismatch between r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):7, (n7,s7):5, next=8, gen=14] r1:/{Min-System/NodeLiveness} [(n1,s1):1, (n6,s6):2, (n4,s4):3, (n2,s2):4, (n7,s7):5, (n3,s3):6LEARNER, next=7, gen=9] ``` The above flake happens because we set the expectation in the map to a descriptor which contains a learner which has since been removed. We shouldn't use a range descriptor which contains learners as the expectation. To avoid that we return an error in the succeeds soon if we come across a descriptor which contains learners. This behavior unvealed another issue, we are way too conservative with replica GC for learners. Most of the time when learners are removed they hear about their own removal, but if they don't we won't consider the Replica for removal for 10 days! This commit changes the replica gc queue behavior to treat learners line candidates. Fixes #40980. Release Justification: bug fixes and low-risk updates to new functionality. Release note: None 41308: storage: remove error from Replica.applyTimestampCache() r=ajwerner a=ajwerner Stumbled upon a function with an error in its return signature that never returns an error. Better to remove it and the stale comment that goes with it. The removal of the code paths which could have returned an error occurred in #33396. Release justification: Low risk, does not change logic. Could also hold off. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Build succeeded |
This PR fixes a test flake in TestSystemZoneConfig:
The above flake happens because we set the expectation in the map to a
descriptor which contains a learner which has since been removed.
We shouldn't use a range descriptor which contains learners as the expectation.
To avoid that we return an error in the succeeds soon if we come across a
descriptor which contains learners. This behavior unvealed another issue,
we are way too conservative with replica GC for learners. Most of the time
when learners are removed they hear about their own removal, but if they don't
we won't consider the Replica for removal for 10 days! This commit changes
the replica gc queue behavior to treat learners line candidates.
Fixes #40980.
Release Justification: bug fixes and low-risk updates to new functionality.
Release note: None